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Abstract 



We study the design of efficient algorithms for combinatorial pattern matching. More concretely, we study 
algorithms for tree matching, string matching, and string matching in compressed texts. 

Tree Matching Survey We begin with a survey on tree matching problems for labeled trees based on 
deleting, inserting, and relabeling nodes. We review the known results for the tree edit distance problem, 
the tree alignment distance problem, and the tree inclusion problem. The survey covers both ordered and 
unordered trees. For each of the problems we present one or more of the central algorithms for each of the 
problems in detail. 

Tree Inclusion Given rooted, ordered, and labeled trees P and T the tree inclusion problem is to determine 
if P can be obtained from T by deleting nodes in T. We show that the tree inclusion problem can be solved 
in 0(np) space with the following running times: 



Here ns and Is denotes the number of nodes and leaves in tree S G {P, T}, respectively, and we assume that 
np < ut- Our results matches or improves the previous time complexities while using only 0(tit) space. 
All previous algorithms required Q{npnp) space in worst-case. 

Tree Path Subsequence Given rooted and labeled trees P and T the tree path subsequence problem is 
to report which paths in P are subsequences of which paths in T. Here a path begins at the root and ends 
at a leaf. We show that the tree path subsequence problem can be solved in 0(np) space with the following 
running times: 



As our results for the tree inclusion problem this matches or improves the previous time complexities while 
using only 0(np) space. All previous algorithms required Sl(npnp) space in worst-case. 

Regular Expression Matching Using the Four Russian Technique Given a regular expression R and 
a string Q the regular expression matching problem is to determine if Q matches any of the strings specified 
by R. We give an algorithm for regular expression matching using Oinmj log n + n + to log to) and 0(n) 
space, where to and n are the lengths of R and Q, respectively. This matches the running time of the fastest 
known algorithm for the problem while improving the space from 0(nm/ log n) to 0(n). Our algorithm 
is based on the Four Russian Technique. We extend our ideas to improve the results for the approximate 
regular expression matching problem, the string edit distance problem, and the subsequence indexing problem. 



'0{l P n T ), 
min < 0{nplr log log np + np) 
P{^ + np\ogn T ). 



mm 



0(lpnp + np), 
0(nplp + n T ), 




iii 



Regular Expression Matching Using Word-Level Parallelism We revisit the regular expression 
matching problem and develop new algorithms based on word-level parallel techniques. On a RAM with a 
standard instruction set and word length w > logn, we show that the problem can be solved in 0(m) space 
with the following running times: 

' 0(n mi£p £+mlogw) if m>w 

< 0(n log m + m log m) if ^/w < m < w 

0(min(n + m 2 , nlogm + mlogm)) if m < ^w. 

This improves the best known time bound among algorithms using 0(m) space. Whenever w > log 2 n it 
improves all known time bounds regardless of how much space is used. 

Approximate String Matching and Regular Expression Matching on Compressed Texts Given 
strings P and Q and an error threshold fc, the approximate string matching problem is to find all ending 
positions of substrings in Q whose unit-cost string edit distance to P is at most k. The unit-cost string 
edit distance is the minimum number of insertions, deletions, and substitutions needed to convert one string 
to the other. We study the approximate string matching problem when Q is given in compressed form 
using Ziv-Lempel compression schemes (more precisely, the ZL78 or ZLW schemes). We present a time-space 
trade-off for the problem. In particular, we show that the problem can be solved in 0(nmk + occ) time and 
0{n/mk + m + occ) space, where n is the length of the compressed version of Q, m is the length of P, and 
occ is the number of matches of P in Q. This matches the best known bound while improving the space 
by a factor 6(m 2 fc 2 ). We extend our techniques to improve the results for regular expression matching on 
Ziv-Lempel compressed strings. 
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Chapter 1 

Introduction 



In this dissertation we study the design of efficient algorithms for combinatorial pattern matching. More 
concretely, we study algorithms for tree matching, string matching, and string matching in compressed 
strings. 

The dissertation consists of this introduction and the following (revised) papers. 

Chapter 2 A Survey on Tree Edit Distance and Related Problems. Philip Bille. Theoretical Computer 
Science, volume 337(1-3), 2005, pages 217-239. 

Chapter 3 The Tree Inclusion Problem: In Optimal Space and Faster. Philip Bille and Inge Li G0rtz. In 
Proceedings of the 32nd International Colloquium on Automata, Languages and Programming, Lecture 
Notes in Computer Science, volume 3580, 2005, pages 66-77. 

Chapter 4 Matching Subsequences in Trees. Philip Bille and Inge Li G0rtz. In Proceedings of the 6th 
Italian Conference on Algorithms and Complexity, Lecture Notes in Computer Science, volume 3998, 
2006, pages 248-259. 

Chapter 5 Fast and Compact Regular Expression Matching. Philip Bille and Martin Farach-Colton. Sub- 
mitted to a journal, 2005. 

Chapter 6 New Algorithms for Regular Expression Matching. Philip Bille. In Proceedings of the 33rd 
International Colloquium on Automata, Languages and Programming, Lecture Notes in Computer 
Science, volume 4051, 2006, pages 643-654. 

Chapter 7 Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Com- 
pressed Texts. Philip Bille and, Rolf Fagerberg, and Inge Li G0rtz. In Proceedings of the 18th Annual 
Symposium on Combinatorial Pattern Matching, 2007, to appear. 

In addition to the above papers I have coauthorcd the following 3 papers during my PhD that are not 
included in the dissertation: 

• Labeling Schemes for Small Distances in Trees. Stephen Alstrup, Philip Bille, and Theis Rauhc. SIAM 
Journal of Discrete Mathematics, volume 19(2), pages 448 - 462. 

• From a 2D Shape to a String Structure using the Symmetry Set. Arjan Kuijper, Ole Fogh Olsen, Peter 
Giblin, Philip Bille, and Mads Nielsen. In Proceedings of the 8th European Conference on Computer 
Vision, Lecture Notes in Computer Science, Volume 3022, 2004, pages 313 - 325. 

• Matching 2D Shapes using their Symmetry Sets. Arjan Kuijper, Ole Fogh Olsen, Peter Giblin, and 
Philip Bille. In Proceedings of the 18th International Conference on Pattern Recognition, 2006, pages 
179-182. 



1 



Of these three papers, the first paper studies compact distributed data structures for trees. The other two 
are papers on image analysis are related to our work on tree matching. The tree matching papers in the 
dissertation and the related image analysis papers are all part of the EU-project "Deep Structure, Singu- 
larities, and Computer Vision" that funded my studies. The project was a collaboration of 15 researchers 
from Denmark, The United Kingdom, and The Netherlands working in Mathematics, Computer Vision, and 
Algorithms. The overall objective of the project was to develop methods for matching images and shapes 
based on multi-scale singularity trees and symmetry sets. The algorithms researchers (Stephen Alstrup, 
Theis Rauhe, and myself) worked on algorithmic issues in tree matching problems. 

1.1 Chapter Outline 

The remaining introduction is structured as follows. In Section 1.2 we define the model of computation. In 
Section 1.3 we summarize our contributions for tree matching and their relationship to previous work. We 
do the same for string matching and compressed string matching in Sections 1.4, and 1.5, respectively. In 
Section 1.6 we give an overview of the central techniques used in this dissertation to achieve our results and 
in Section 1.7 we conclude the introduction. 

1.2 Computational Model 

Before presenting our work, we briefly define our model of computation. The Random Access Machine model 
(RAM), formalized by Cook and Reckhow [CR72], captures many of the properties of a typical computer. 
We will consider the word-RAM model variant as defined by Hagerup [Hag98]. Let w be a positive integer 
parameter called the word length. The memory of the word-RAM is an infinite array of cells each capable 
of storing a w-bit integer called a word. We adopt the usual assumption that w > logn, where n is the size 
of the input, i.e., an index or pointer to the input fits in a word. Most of the problems in this dissertation 
are defined according to a set of characters or labels called an alphabet. We assume that each input element 
from alphabet is encoded as a w-bit integer in a word. 

The instruction set includes operations on words such as addition, subtraction, bitwise shifting, bitwise 
and, bitwise or and bitwise xor, multiplication, and division. Each operation can be computed in unit time. 
The space complexity of an algorithm is the maximum number of cells used at any time beside the input, 
which is considered read-only. The time to access a cell at index i is 0(\(\ogi)/w~\ ), i.e., the access time is 
proportional to the number of words needed to write the index in binary. In particular, any data structure 
of size 2°( w ^ can be accessed in constant time. We will only encounter super-constant access time in our 
discussion of the regular expression matching problem where very large data structures appear. 

Word-RAM algorithms can be weakly non-uniform, that is, the algorithm has access to a fixed number 
of word-size constants that depend on w. These constants may be thought of a being computed at "compile 
time" . For several of our results, we use a deterministic dictionary data structure of Hagerup et al. [HMP01] 
that requires weak non-uniformity. However, in all cases our results can easily be converted to work without 
weak non-uniformity (see Section 1.6.1 for details). 

1.3 Tree Matching 

The problem of comparing trees occurs in areas as diverse as structured text data bases (XML), compu- 
tational biology, compiler optimization, natural language processing, and image analysis [KTSKOO, H082, 
KM95a, RR92, Tai79, ZS89]. For example, within computational biology the secondary structure of RNA 
is naturally represented as a tree [Wat95, Gus97]. Comparing the secondary structure of RNA helps to 
determine the functional similarities between these molecules. 

In this dissertation we primarily consider comparing trees based on simple tree edit operations consisting 
of deleting, inserting, and relabeling nodes. Based on these operations researcher have derived several 
interesting problems such as the tree edit distance problem, the tree alignment distance problem, and the tree 
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inclusion problem. Chapter 2 contains a detailed survey of each of these problems. The survey covers both 
ordered trees, with a left-to-right order among siblings, and unordered trees. For each problem one or more of 
the central algorithms are presented in detail in order to illustrate the techniques and ideas used for solving 
the problem. 

The survey is presented in the original published form except for minor typographical corrections. How- 
ever, significant progress has been made on many of the problems since publication. To account for these, 
we give a brief introduction to each of the problems and discuss recent developments, focusing on our own 
contributions to the tree inclusion problem and the tree path subsequence problem. 

1.3.1 Tree Edit Operations 

Let T be a rooted tree. We call T a labeled tree if each node is a assigned a symbol from a finite alphabet 
S. We say that T is an ordered tree if a left-to-right order among siblings in T is given. If T is an ordered 
tree the tree edit operations are defined as follows: 

relabel Change the label of a node v in T. 

delete Delete a non-root node v in T with parent v', making the children of v become the children of v' . 
The children are inserted in the place of v as a subsequence in the left-to-right order of the children of 
v'. 

insert The complement of delete. Insert a node v as a child of v' in T making v the parent of a consecutive 
subsequence of the children of v' . 

For unordered trees the operations can be defined similarly. In this case, the insert and delete operations 
works on a subset instead of a subsequence. Figure 2.1 on page 20 illustrates the operations. 

1.3.2 Tree Edit Distance 

Let P and T be two rooted and labeled trees called the pattern and the target, respectively. The tree edit 
distance between P and T is the minimum cost of transforming P to T by sequence of tree edit operations 
called an edit script. The cost of each tree edit operation is given by metric cost function assigning a real 
value to each operation depending on the labels of the nodes involved. The cost of a sequence of edit 
operations is the sum of the costs of the operations in the sequence. The tree edit distance problem is to 
compute the tree edit distance and a corresponding minimum cost edit script. 

To state the complexities for the problem let np, Ip, dp, and ip denote the number of nodes, number 
of leaves, the maximum depth, and the maximum in-degree of P, respectively. Similarly, define tit, It, dr, 
and it for T. For simplicity in our bounds we will assume w.l.o.g. that np < tit- 

The ordered version of the tree edit distance problem was originally introduced by Tai [Tai79], who 
gave an algorithm using 0(npnTlplx) time and space. In worst-case this is 0(npn T ) — 0(n T ). Zhang and 
Shasha [ZS89] gave an improved algorithm using Oinpnr min(Zp, dp) minQr, dr)) time and 0(npnr) space. 
Note that in worst-case this is 0(npn T ) = 0(n^) time. Klein [Kle98] showed how to improve the worst-case 
running time to 0(npriT log rip) = 0(n T lognp). The latter two algorithms are both based on dynamic 
programming and may be viewed as different ways of computing a subset of the same dynamic programming 
table. The basic dynamic programming idea is presented in Section 2.3.2.1 and a detailed presentation of 
Zhang and Shasha's and Klein's algorithms is given in Section 2.3.2.2 and 2.3.2.3. 

Using fast matrix multiplication Chen [ChcOl] gave an algorithm using 0(npriT + Ipnx + lp 5 lr) time 
and 0((np + Ip) min(7p, dp) + fir) space. In worst-case this algorithm runs in 0(npn^ 5 ) — 0(n^ 5 ) time. 

In [DT05] Dulucq and Touzet introduced the concept of decomposition strategies as a framework for 
algorithms based on the same type of dynamic program as [ZS89, Kle98]. They proved a lower bound of 
Q(npnT log np lognp) for any such strategy. Very recently, Demaine et al. [DMRW07] gave a new algorithm 
for tree edit distance within the decomposition strategy framework. In worst-case this algorithms uses 
0(nprjp(l + log^-)) = 0(n T ) time and O(npnp) space. They also proved a matching worst-case lower 
bound for all algorithms within the decomposition strategy framework. 
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An interesting special case of the problem is the unit-cost tree edit distance problem, where the goal is 
to compute the number of edit operations needed to transform P to T. Inspired by techniques from string 
matching [Ukk85b,LV89], Zhang and Shasha [SZ90] proposed an algorithm for the ordered unit-cost tree edit 
distance problem. If u is the number of tree edit operations needed to transform P into T their algorithm 
runs in 0(u 2 min{np, nx} minjip, It}) time. Hence, if the distance between P and T is small this algorithm 
significantly improves the bounds for the general tree edit distance problem. In a recent paper, Akutsu et 
al. [AFT06] gave an approximation algorithm for the unit-cost tree edit distance problem. They gave an 
algorithm using 0(npnx) time that approximates the unit-cost tree edit distance for bounded degree trees to 
within a factor of 0(n^/ 4 ). The idea in their algorithm is to extract modified Euler strings (the sequence of 
labels obtained by visiting the tree in a depth-first left-to-right order) and subsequently compute the string 
edit distance (see Section 1.4.1) between these. This algorithm is based on earlier work on the relationship 
between the unit-cost tree edit distance and string edit distance of the corresponding Euler strings [Aku06] . 

Zhang et al. [ZSS92] showed that the unordered tree edit distance problem (recast as a decision problem) 
is NP-complete even for binary trees with an alphabet of size 2. Later, Zhang and Jiang [ZJ94] showed that 
the problem is MAX-SNP hard. 

1.3.3 Constrained Tree Edit Distance 

Given that unordered tree edit distance is NP-complete and the algorithms for ordered tree edit distance are 
not practical for large trees, several authors have proposed restricted forms and variations of the problem. 
Selkow [Sel77] introduced the degree- 1 edit distance, where insertions and deletions are restricted to the 
leaves of the trees. Zhang et al. [Zha96b, ZWS96] introduced the degree-2 edit distance, where insertions and 
deletions arc restricted to nodes with zero or one child. Zhang [Zha95, Zha96a] introduced the constrained 
edit distance that generalizes the degree-2 edit distance. Informally, constrained edit scripts must transform 
disjoint subtrees to disjoint subtrees (see Section 2.3.4). In [Zha95, Zha96a] Zhang presented algorithms for 
the constrained edit distance problem. For the ordered case he obtained 0(npnr) time and for the unordered 
case he obtained 0(npriT(ip + ir) log(ip +«t)) time. Both use space 0(npnT)- Richter [Ric97b] presented 
an algorithm for the ordered version of the problem using 0{npnTipir) time and O^updrir)- Hence, for 
small degree and low depth trees this is a space improvement of Zhang's algorithm. Recently, Wang and 
Zhang [WZ05] showed how to achieve 0(npnr) and 0(np \ognr) space. The key idea is to process subtrees 
of T according to a heavy-path decomposition of T (see Section 1.6.2). 

For other variations and analysis of the tree edit distance problem see Section 2.3.5 and also the recent 
work in [Tou03,DT03,GK05,Tou05, JP06]. 

1.3.4 Tree Alignment Distance 

An alignment of P and T is obtained by inserting specially labeled nodes (called spaces) into P and T so 
they become isomorphic when labels are ignored. The resulting trees are then overlayed on top of each other 
giving the alignment A. The cost of the alignment is the cost of all pairs of opposing labels in A and the 
optimal alignment is the alignment of minimum cost. The tree alignment distance problem is to compute a 
minimum cost alignment of P and T. 

For strings the alignment distance and edit distance are equivalent notions. More precisely, for any two 
strings A and B the edit distance between A and B equals the value of an optimal alignment of A and 
B [Gus97]. However, for trees edit distance and alignment distance can be different (see the discussion in 
Section 2.4). 

The tree alignment distance problem was introduced by Jiang et al. [JWZ95] who gave algorithms for 
both the ordered and unordered version of the problem. For the ordered version they gave an algorithm 
using 0(npnx{ip + ir) 2 ) time and 0(npnp(ip + It)) space. Hence, if P and T have small degrees this 
algorithm outperforms the known algorithms for ordered tree edit distance. For the unordered version Jiang 
et al. [JWZ95] show how to modify their algorithm such that it still runs in ©(npnp) time for bounded 
degree trees. On the other hand, if one of the trees is allowed to have arbitrary degree the problem becomes 
MAX SNP-hard. Recall that the unordered tree edit distance problem is MAX SNP-hard even if both tree 
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have bounded degree. The algorithm by Jiang et al. [JWZ95] for ordered tree alignment distance is discussed 
in detail in Section 2.4.1.1. 

For similar trees Jansson and Lingas [JL03] presented a fast algorithm for ordered tree alignment. More 
precisely, if an optimal alignment requires at most s spaces their algorithm computes the alignment in 
0((np + n,T)\-Og{np + np){ip + ir) 3 s 2 ) time 1 . Their algorithm may be viewed as a generalization of the 
fast algorithms for comparing similar sequences, see e.g., Section 3.3.4 in [SM97]. The recent techniques 
for space-efficient computation of constrained edit distances of Wang and Zhang [WZ05] also also apply to 
alignment of trees. Specifically, Wang and Zhang gave an algorithm for the tree alignment distance problem 
using 0(npriT{ip + «t) 2 ) time and 0(npip\ognT{ip + It)) space. Hence, they match the running time 
of Jiang et al. [JWZ95] and whenever it log = o(np) they improve the space. This result improves an 
earlier space-efficient but slow algorithm by Wang and Zhao [WZ03]. 

Variations for more complicated cost functions for the tree alignment distance problem can be found 
in [HTGK03,JHS06]. 



1.3.5 Tree Inclusion 

The tree inclusion problem is defined as follows. We say that P is included in T if P can be obtained from 
T by deleting nodes in T. The tree inclusion problem is to determine if P can be included in T and if so 
report all subtrees of T that include P. 

The tree inclusion problem has recently been recognized as a query primitive for XML databases, 
see [SM02,YLH03,YLH04,ZADR03,SN00,TRS02]. The basic idea is that an XML database can be viewed 
as a labeled and ordered tree, such that queries correspond to solving a tree inclusion problem (see Figure 3.1 
on page 40). 

The tree inclusion problem was introduced by Knuth [Knu69, exercise 2.3.2-22] who gave a sufficient 
condition for testing inclusion. Kilpelainen and Mannila [KM95a] studied both the ordered and unordered 
version of the problem. For unordered trees they showed that the problem is NP-complete. The same 
result was obtained independently by Matousek and Thomas [MT92]. For ordered trees Kilpelainen and 
Mannila [KM95a] gave a simple dynamic programming algorithm using 0(npnr) time and space. This 
algorithm is presented in detail in Section 2.5.2.1. 

Several authors have improved the original dynamic programming algorithm. Kilpelainen [Kil92] gave a 
more space efficient version of the above algorithm using O(npdT) space. Richter [Ric97a] gave an algorithm 
using 0(apnp + mp^dr) time, where ap is the size of the alphabet of the labels in P and mpj- is the set of 
matches, defined as the number of pairs of nodes in P and T that have the same label. Hence, if the number 
of matches is small the time complexity of this algorithm improves the O(npnp) time bound. The space 
complexity of the algorithm is Oiopnp + mpp). Chen [Che98] presented a more complex algorithm using 
0{Iptit) time and 0{L\lp min(dx, It)) space. Notice that the time and space complexity is still O(npnr) 
in worst-case. 

A variation of the problem was studied by Valiente [Val05] and Alonso and Schott [AS01] gave an efficient 
average case algorithm. 



Our Results and Techniques In Chapter 3 we give three new algorithms for the tree inclusion problem 
that together improve all the previous time and space bounds. More precisely, we show that the tree inclusion 
problem can be solved in 0(tit) space with the following running time (Theorem 5): 

'0(l P n T ), 
min < 0(nplp log log rip + rip), 
0{^ + n T \o g n T ). 

Hence, when either P or T has few leaves we obtain fast algorithms. When both trees have many leaves 
and n P = n(log 2 7iT), we instead improve the previous quadratic time bound by a logarithmic factor. In 
particular, we significantly improve the space bounds which in practical situations is a likely bottleneck. 



1 Note that the result reported in Chapter 2 is the slightly weaker bound from the conference version of their paper [JL01]. 
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Our new algorithms are based on a different approach than the previous dynamic programming algo- 
rithms. The key idea is to construct a data structure on T supporting a small number of procedures, called 
the set procedures, on subsets of nodes of T. We show that any such data structure implies an algorithm for 
the tree inclusion problem. We consider various implementations of this data structure all of which use linear 
space. The first one gives an algorithm with 0(IpTIt) running time. Secondly, we show that the running 
time depends on a well-studied problem known as the tree color problem. We give a connection between the 
tree color problem and the tree inclusion problem and using a data structure of Dietz [Die89] we immediately 
obtain an algorithm with 0(nplr log log tit + tit) running time (see also Section 1.6.1). 

Based on the simple algorithms above we show how to improve the worst-case running time of the 
set procedures by a logarithmic factor. The general idea is to divide T into small trees called clusters of 
logarithmic size, each of which overlap with other clusters on at most 2 nodes. Each cluster is represented 
by a constant number of nodes in a macro tree. The nodes in the macro tree are then connected according to 
the overlap of the cluster they represent. We show how to efficiently prcprocess the clusters and the macro 
tree such that the set procedures use constant time for each cluster. Hence, the worst-case quadratic running 
time is improved by a logarithmic factor (see also Section 1.6.2). 

1.3.6 Tree Path Subsequence 

In Chapter 4 we study the tree path subsequence problem defined as follows. Given two sequences of labeled 
nodes p and t, we say that p is a subsequence of t if p can be obtained by removing nodes from t. Given 
two rooted, labeled trees P and T the tree path subsequence problem is to determine which paths in P are 
subsequences of which paths in T. Here a path begins at the root and ends at a leaf. That is, for each path 
p in P, we must report all paths t in T such that p is a subsequence of t. 

In the tree path subsequence problem each path is considered individually, in the sense that removing a 
node from a path do not affect any of the other paths that the node lies on. This should be seen in contrast 
to the tree inclusion problem where each node deletion affects all of these paths. By the definition tree path 
subsequence does not fit into tree edit operations framework and whether or not the trees are ordered does 
not matter as long as the paths can be uniquely identified. 

A necessary condition for P to be included in T is that all paths in P are subsequences of paths in T. 
As we will see shortly, the tree path subsequence problem can be solved in polynomial time and therefore 
we can use algorithms for tree path subsequence as a fast heuristic for unordered tree inclusion (recall that 
unordered tree inclusion is NP-complete). Section 4.1.1 contains a detailed discussion of applications. 

Tree path subsequence can be solved trivially in polynomial time using basic techniques. Given two 
strings (or labeled paths) a and b, it is straightforward to determine if a is a subsequence of b in 0(\a\ + \b\) 
time. It follows that we can solve tree path subsequence in worst-case 0(npriT(np+nT)) time. Alternatively, 
Baeza- Yates [BY91] gave a data structure using 0(\b\ log |6|) preprocessing time such that testing whether a is 
a subsequence of b can be done in 0(\a\ log \b\) time. Using this data structure on each path in T we obtain 
solution to the tree path subsequence problem using 0(n^\ognT + nplognx) time. The data structure 
for subsequences can be improved as discussed in Section 1.4.4. However, a specialized and more efficient 
solution was discovered by Chen [ChcOO] who showed how to solve the tree path subsequence problem in 
0(mm(lpnT + np,nplx + nr)) time and 0(lpdr + np + nx) space. Note that in worst-case this is V,(npnp) 
time and space. 

Our Results and Techniques In Chapter 4 we give three new algorithms for the tree path subsequence 
problem improving the previous time and space bounds. Concretely, we show that the problem can be solved 
in O(n-r) space with the following time complexity (Theorem 9): 

!0(l P n T + n P ), 
0(nplx + nr), 
0(^E+n T + nplogn P ). 
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The first two bounds in Theorem 9 match the previous time bounds while improving the space to linear. The 
latter bound improves the worst-case 0(npriT) running time whenever lognp = 0(n,T / 'lognr)- Note that 
- in worst-case - the number of pairs consisting of a path from P and a path T is Slfapnr), and therefore 
we need at least as many bits to report the solution to TPS. Hence, on a RAM with logarithmic word size 
our worst-case bound is optimal. 

The two first bounds are achieved using an algorithm that resembles the algorithm of Chen [CheOO]. 
At a high level, the algorithms are essentially identical and therefore the bounds should be regarded as an 
improved analysis of Chen's algorithm. The latter bound is achieved using an entirely new algorithm that 
improves the worst-case 0(npnx) time. Specifically, whenever lognp = 0(«t/ lognp) the running time is 
improved by a logarithmic factor. 

Our results are based on a simple framework for solving tree path subsequence. The main idea is to 
traverse T while maintaining a subset of nodes in P, called the state. When reaching a leaf z in T the state 
represents the paths in P that are a subsequences of the path from the root to z. At each step the state 
is updated using a simple procedure defined on subset of nodes. The result of Theorem 9 is obtained by 
taking the best of two algorithms based on our framework: The first one uses a simple data structure to 
maintain the state. This leads to an algorithm using 0(min(lpTiT + np,nplp + tit)) time. At a high level 
this algorithm resembles the algorithm of Chen [CheOO] and achieves the same running time. However, we 
improve the analysis of the algorithm and show a space bound of O(n-r). Our second algorithm combines 
several techniques. Starting with a simple quadratic time and space algorithm, we show how to reduce the 
space to 0(np log tit) using a heavy-path decomposition of T. We then divide P into small subtrees of size 
O(lognT) called micro trees. The micro trees are then preprocessed such that subsets of nodes in a micro 
tree can be maintained in constant time and space leading to a logarithmic improvement of the time and 
space bound (see also Section 1.6.2). 

1.4 String Matching 

String matching is a classical core area within theoretical and practical algorithms, with numerous applica- 
tions in areas such as computational biology, search engines, data compression, and compilers, see [Gus97]. 

In this dissertation we consider the string edit distance problem, approximate string matching problem, 
regular expression matching problem, approximate regular expression matching problem, and the subse- 
quence indexing problem. In the following sections we present the known results and our contributions for 
each of these problems. 

1.4.1 String Edit Distance and Approximate String Matching 

Let P and Q be two strings. The string edit distance between P and T is the minimum cost of transforming 
P to Q by a sequence of insertions, deletions, and substitutions of characters called the edit script. The cost 
of each edit operation is given by a metric cost function. The string edit distance problem is to compute the 
string edit distance between P and Q and a corresponding minimum cost edit script. Note that the string 
edit distance is identical to the tree edit distance if the trees are paths. 

The string edit distance problem has numerous applications. For instance, algorithms for it and its 
variants are widely used within computational biology to search for gene sequences in biological data bases. 
Implementations are available in the popular Basic Local Alignment Search Tool (BLAST) [AGM + 90]. 

To state the complexities for the problem, let m and n be the lengths of P and Q, respectively, and assume 
w.l.o.g. that m < n. The standard textbook solution to the problem, due to Wagner and Fischer [WF74], 
fills inanm + lxra+l size distance matrix D such that Dij is the edit distance between the ith prefix of 
P and the jth prefix of Q. Hence, the string edit distance between P and T can be found in D m ^ n . Using 
dynamic programming each entry in D can be computed in constant time leading to an algorithm using 
0(mn) time and space. Using a classic divide and conquer technique of Hirschbcrg [Hir75] the space can be 
improved to 0(m). More details of the dynamic programming algorithm can be found in Section 5.5. 
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For general cost functions Crochemore et al. [CLZU03] recently improved the running time for the string 
edit distance problem to 0(nm/ log m + n) time and space for a constant sized alphabet. The result is 
achieved using a partition of the distance matrix based on a Ziv-Lempel factoring [ZL78] of the strings. 

For the unit-cost string edit distance problem faster algorithms are known. Masek and Paterson [MP80] 
showed how to encode and compactly represent small submatrices of the dynamic programming table. The 
space needed for the encoded submatrices is Q,{n) but the dynamic programming algorithm can now be 
simulated in 0(mn/ log 2 n + m + n) time 2 . This encoding and tabulating idea in this algorithm is often 
referred to as the Four Russian technique after Arlazarov et al. [ADKF70] who introduced the idea for 
boolean matrix multiplication. The algorithm of Masek and Paterson assumes a constant sized alphabet and 
this restriction cannot be trivially removed. The details of their algorithm is given i Section 5.5. 

Instead of encoding submatrices of the dynamic programming table using large tables, several algorithms 
based on simulating the dynamic programming algorithm using the arithmetic and logical operations of 
the word RAM have been suggested [BYG92, WM92b, Wri94, BYN96, Mye99, HN05] . We will refer to this 
technique as word-level parallelism (see also Section 1.6.3). Myers [Mye99] gave a very practical 0(nm/w + 
n + ma) time and 0(ma/w + n + m) space algorithm based on word-level parallelism. The algorithm can 
be modified in a straightforward fashion to handle arbitrary alphabets in 0(nm/w + n + mlogm) time and 
0(m) space by using deterministic dictionaries [HMP01]. 

A close relative of the string edit distance problem is the approximate string matching problem. Given 
strings P and Q and an error threshold k, the goal is to find all ending positions of substrings in Q whose 
unit-cost string edit distance to P is at most k. Sellers [Scl80] showed how a simple modification of the 
dynamic programming algorithm for string edit distance can be used solve approximate string matching. 
Consequently, all of the bounds listed above for string edit distance also hold for approximate string matching. 

For more variations of the string edit distance and approximate string matching problems and algorithms 
optimized for various properties of the input strings see, e.g., [Got82, Ukk85a, Mye86, MM88, EGG88, LV89, 
EGGI92,LMS98,MNU05,ALP04,CM07,CH02]. For surveys see [Mye91,Nav01a,Gus97]. 

Our Results and Techniques In Section 5.5 we revisit the Four Russian algorithm of Masek and Pa- 
terson [MP80] and the assumption that the alphabet size must be constant. We present an algorithm using 
0{nm log log n/ log 2 n + m + n) time and 0{n) space that works for any alphabet (Theorem 15). Thus, 
we remove the alphabet assumption at the cost of a factor log log n in the running time. Compared with 
Myers' algorithm [Mye99] (modified to work for any alphabet) that uses 0(nm/w + n + mlogm) time our 
algorithm is faster when l< ^°f c ^ n = o(w) (assuming that the first terms of the complexities dominate). Our 
result immediately generalizes to approximate string matching. 

The key idea to achieve our result is a more sophisticated encoding of submatrices of the distance ma- 
trix that maps input characters corresponding to the submatrix into a small range of integers. However, 
computing this encoding directly requires too much time for our result. Therefore we construct a two- level de- 
composition of the distance matrix such that multiple submatrices can be efficiently encoded simultaneously. 
Combined these ideas lead to the stated result. 

1.4.2 Regular Expression Matching 

Regular expressions are a simple and flexible way to recursively describe a set of strings composed from 
simple characters using union, concatenation, and Kleene star. Given a regular expression R and a string Q 
the regular expression matching problem is to decide if Q matches one of the string denoted by R. 

Regular expression are frequently used in the lexical analysis phase of compilers to specify and distinguish 
tokens to be passed to the syntax analysis phase. Standard programs such as Grep, the programming 
languages Perl [Wal94] and Awk [AKW98], and most text editors, have mechanisms to deal with regular 

2 Note that the result stated by the authors is a logn factor slower. This is because they assumed a computational model 
where operations take time proportional to their length in bits. To be consistent we have restated their result in the uniform 
cost model. 
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expressions. Recently, regular expression have also found applications in computational biology for protein 
searching [NR03]. 

Before discussing the known complexity results for regular expression matching we briefly present some 
of the basic concepts. More details can be found in Aho et al. [ASU86] . 

The set of regular expressions over an alphabet £ is defined recursively as follows: A character a G £ 
is a regular expression, and if S and T are regular expressions then so is the concatenation, (S) ■ (T), the 
union, (S)\(T), and the star, (S)* . The language L(R) generated by R is defined as follows: L(a) = {a}, 
L(S ■ T) = L{S) ■ L(T), that is, any string formed by the concatenation of a string in L(S) with a string in 
L{T), L(S)\L(T) = L(S) U L(T), and L(S*) = \Ji>o L (Sy, where L(S)° = {e} and L(Sy = L^) 1 - 1 ■ L(S), 
for i > 0. Given a regular expression R and a string Q the regular expression matching problem is to decide 
/.;//,. 

A finite automaton is a tuple A — (V, E, S, 9, $), where V is a set of nodes called states, E is set of 
directed edges between states called transitions each labeled by a character from £ U {e}, 9 € V is a start 
state, and $ C V is a set of final states. In short, A is an edge- labeled directed graph with a special start 
node and a set of accepting nodes. A is a deterministic finite automaton (DFA) if A does not contain any e- 
transitions, and all outgoing transitions of any state have different labels. Otherwise, A is a non- deterministic 
automaton (NFA). We say that A accepts a string Q if there is a path from the start state to an accepting 
state such that the concatenation of labels on the path spells out Q. Otherwise, A rejects Q. 

Let R be a regular expression of length m and let Q be a string of length n. The classic solution to 
regular expression matching is to first construct a NFA A accepting all strings in L(R). There are several 
NFA constructions with this property [MY60,Glu6f,Tho68]. Secondly, we simulate A on Q by producing a 
sequence of state-sets So, - ■ ■ ,S n such that Si consists of all states in A for which there is a path from the 
start state of A that spells out the ith prefix of Q. Finally, S n contains an accepting state of A if and only 
if A accepts Q and hence we can determine if Q matches a string in L{R) by inspecting S n . 

Thompson [Tho68] gave a simple well-known NFA construction for regular expressions that we will call 
a Thompson-NFA (TNFA). For R the TNFA A has at most 2m states and Am transitions, a single accepting 
state, and can be computed in 0(m) time. Each of the state-set in the simulation of A on Q can be computed 
in 0(m) time using a breadth-first search of A. This implies an algorithm for regular expression matching 
using 0(nm) time. Each of the state-sets only depends on the previous one and therefore the space is 0(m). 
The full details of Thompson's construction is given in Section 6.2. 

We note that the regular expression matching problem is sometimes defined as reporting all of the ending 
positions of substrings of Q matching R. Thompson's algorithm can easily be adapted without loss of 
efficiency for this problem. Simply add the start state to the current state-set before computing the next 
and inspect the accepting state of the state-sets at each step. All of the algorithms presented in this section 
can be adapted in a similar fashion such that the bounds listed below also hold for this variation. 

In practical implementations regular expression matching is often solved by converting the NFA accepting 
the regular expression into a DFA before simulation. However, in worst-case the standard DFA-construction 
needs 0(2 2m \m/uP\ a) space. With a more succinct representation of the DFA the space can be reduced 
to 0((2 m + a) \m/w\) [NR04, WM92b]. Note that the space complexity is still exponential in the length of 
the regular expression. Normally, it is reported that the time complexity for simulating the DFA is 0(n), 
however, this analysis does not account for the limited word size of the word RAM. In particular, since there 
are 2 m states in the DFA each state requires O(m) bits to be addressed. Therefore we may need Cl(\m/iv]) 
time to identify the next state and thus the total time to simulate the DFA becomes fl(n \m/w~\). This bound 
is matched by Navarro and Raffinot [NR04] who showed how to solve the problem in 0((n + 2 m ) \m/w]) 
time and 0((2 m +a) \m/w\) space. Navarro and Raffinot [NR04] suggested using a table splitting technique 
to improve the space complexity of the DFA algorithm for regular expression matching. For any s this 
technique gives an algorithm using 0((n + 2 m / s )s \m/w\) time and 0{(2 m / s s + a) \m/w\) space. 

The DFA-based algorithms are primarily interesting for sufficiently small regular expressions. For in- 
stance, if m = O(logn) it follows that regular expression matching can be solved in O(n) time and 0(n + a) 
space. Several heuristics can be applied to further improve the DFA based algorithms, i.e., we often do 
not need to fill in all entries in the table, the DFA can be stored in an adjacency- list representation and 
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minimized, etc. None of these improve the above worst-case complexities of the DFA based algorithms. 

Myers [Mye92a] showed how to efficiently combine the benefits of NFAs with DFA. The key idea in 
Myers' algorithm is to decompose the TNFA built from R into 0([~m/logn~|) subautomata each consisting 
of O(logn) states. Using the Four Russian technique [ADKF70] each subautomaton is converted into a DFA 
using 0(2 m ) = 0(n) space giving a total space complexity of 0(nm/logn). The subautomata can then be 
simulated in constant time leading to an algorithm using 0(nm/ logn + (n + m) logm) time. The details of 
Myers' algorithm can be found in Sections 5.2 and 6.3. 

For variants and extension of the regular expression matching problem see [KM95b, MOG98, YamOl, 
NR03,YM03,ISY03]. 

Our Results and Techniques In Section 5.2 we improve the space complexity of Myers' Four Russian 
algorithm. We present an algorithm using 0(nm/ logn + n + to log to) time and 0(n) space (Theorem 12). 
Hence, we match or improve the running time of Myers' algorithm while we significantly improve the space 
complexity from (3(nm/logn) to 0(n). 

As in Myers' algorithm, our new result is achieved using a decomposition of the TNFA into small sub- 
automata of 8(logn) states. To improve the space complexity we give a more efficient encoding. First, we 
represent the labels of transitions in each subautomaton using deterministic dictionaries [HMP01]. Secondly, 
we bound the number of distinct TNFAs without labels on transitions. Using this bound we show that it is 
possible to encode all TNFAs with x = 0(logn) states in total space 0(n), thereby obtaining our result. 

Our space-efficient Four Russian algorithm for regular expression matching is faster than Thompson's 
algorithm which uses 0{nm) time. However, to achieve the speedup we use f2(n) space, which may still be 
significantly larger than the 0(m) space used by Thompson's algorithm. In Chapter 6 we study a different and 
more space-efficient approach to regular expression matching. Specifically, we show that regular expression 
matching can be solved in 0(m) space with the following running times (Theorem 16): 

if to > w 
if y/w < m < w 
if to < \/w. 

To compare these bounds with previous results, let us assume a conservative word length of w = logn. When 
the regular expression is "large", e.g., if to > logn, we achieve an 0( l ^°f ( ^ n ) factor speedup over Thompson's 
algorithm using 0(m) space. In this case we simultaneously match the best known time and space bounds 
for the problem, with the exception of an O(loglogn) factor in time. Next, consider the case when the 
regular expression is "small", e.g., m = O(logn). In this case, we get an algorithm using 0(n log logn) time 
and O(logn) space. Hence, the space is improved exponentially at the cost of an O(loglogn) factor in time. 
In the case of an even smaller regular expression, e.g., to = 0(\/\og n), the slowdown can be eliminated 
and we achieve optimal O(n) time. For larger word lengths, our time bounds improve. In particular, when 
w > logn log logn the bound is better in all cases, except for y/w < m < w, and when w > log 2 n it improves 
the time bound of Myers' algorithm. 

As in Myers' and our previous algorithms for regular expression matching, this algorithm is based on 
a decomposition of the TNFA. However, for this result, a slightly more general decomposition is needed 
to handle different sizes of subautomata. We provide this by showing how any "black-box" algorithm 
for simulating small TNFAs can efficiently converted into an algorithm for simulating larger TNFAs (see 
Section 6.3 and Lemma 38). To achieve 0(m) space we cannot afford to encode the subautomata as in the 
Four Russian algorithms. Instead we present two algorithms that simulate the subautomata using word- 
level parallelism. The main problem in doing so is the complicated dependencies among states in TNFAs. 
A state may be connected via long paths of e-transitions to number of other states, all of which have to be 
traversed in parallel to simulate the TNFA. Our first algorithm, presented in Section 6.4, simulates TNFAs 
with 0(y/w) states in constant time for each step. The main idea is to explicitly represent the transitive 
closure of the e-paths compactly in a constant number of words. Combined with a number of simple word 
operations to we show how to compute the next state-set in constant time. Our second, more complicated 



0(n=^p + to log w) 
0{n log to + to log to) 
0(min(n + to 2 , n log to + to logm)) 
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algorithm, presented in Section 6.5 simulates TNFAs of with 0(w) states in O(logw) time for each step. 
Instead of representing the transitive closure of the of the e-paths this algorithm recursively decomposes the 
TNFA into 0(\ogw) levels that represent increasingly smaller subautomata. Using this decomposition we 
then show to traverse all of the e-paths in constant time for each level. We combine the two algorithms 
with our black-box simulation of large TNFAs, and choose the best algorithm in the various cases to get the 
stated result. 

1.4.3 Approximate Regular Expression Matching 

Given a regular expression R, a string Q, and an error threshold k the approximate regular expression 
matching problem is to determine if the minimum unit-cost string edit distance between Q and a string in 
L(R) is at most k. As in the above let m and n be the lengths of R and Q, respectively. 

Myers and Miller [MM89] introduced the problem and gave an 0(nm) time and 0(m) space algorithm. 
Their algorithm is an extension of the standard dynamic programming algorithm for approximate string 
matching adapted to handle regular expressions. Note that the time and space complexities are the same 
as in the simple case of strings. Assuming a constant sized alphabet, Wu et al. [WMM95] proposed a Four 



Russian algorithm using Q( m "|°g(k+ 2 ) _|_ n _|_ TO ) time an d Q ^n^/n\og(k+2) _|_ n ^ m ) space. This algorithm 



combines decomposition of TNFAs into subautomata as the earlier algorithm of Myers for regular expression 
matching [Mye92a] and the dynamic programming idea of Myers and Miller [MM89] for approximate regular 
expression matching. Recently, Navarro [Nav04] proposed a practical DFA based solution for small regular 
expressions. 

Variants of approximate regular expression matching including extensions to more complex cost functions 
can be found in [MM89, KM95c, Mye92b, MOG98, NR03]. 

Our Results and Techniques In Section 5.3 we present an algorithm for approximate regular expression 
matching using 0( mn 1 1 ° s ^ +2 ' ) +n + m log m) time and 0(n) space that works for any alphabet (Theorem 13). 
Hence, we match the running time of Wu et al. [WMM95] while improving the space complexity from 



We obtain the result as a simple combination and extension of the techniques used in our Four Russian 
algorithm for regular expression matching and the algorithm of Wu et al. [WMM95]. 

1.4.4 Subsequence Indexing 

Recall that a subsequence of a string Q is a string that can be obtained from Q by deleting zero or more 
characters. The subsequence indexing problem is to prcprocess a string Q into a data structure efficiently 
supporting queries of the form: "Is P a subsequence of Q?" for any string P. 

Baeza- Yates [BY91] introduced the problem and gave several algorithms. Let m and n denote the length 
of P and Q, respectively, let a be the size of the alphabet. Baeza- Yates showed that the subsequence indexing 
problem can either be solved using 0(ncr) space and 0{m) query time, 0{n log a) space and 0(m log a) query 
time, or 0(n) space and O(mlogn) query time. For these algorithm the preprocessing time matches the 
space bounds. 

The key component in Baeza- Yates' solutions is a DFA called the directed acyclic subsequence graph 
(DASG). Baeza- Yates obtains the first trade-off listed above by explicitly constructing the DASG and using 
it to answer queries. The second trade-off follows from an encoded version of the DASG and the third 
trade-off is based on simulating the DASG using predecessor data structures. 

Several variants of subsequence indexing have been studied, see [DFG + 97,BCGM99] and the surveys [TroOl, 
CMT03] . 

Our Results and Techniques In Section 5.4 we improve the bounds for subsequence indexing. We show 
how to solve the problem using O{no x l 2 ) space and preprocessing time and 0(m(l + 1)) time for queries, 
for < I < log log a (Theorem 14). In particular, for constant I we get a data structure using 0{na e ) space 




rriy/n logffc+2) 
log n 



+ n + m) to 0(n). 
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and preprocessing time and 0(m) query time and for I = log logo- we get a data structure using 0(n) space 
and preprocessing time and 0(m log log cr) for queries. 

The key idea is a simple two-level decomposition of the DASG that efficiently combines the explicit DASG 
with a fast predecessor structure. Using the classical van Emde Boas data structure [vEBKZ77] leads to 
0(n) space and preprocessing time with 0(m log log a) query time. To get the full trade-off, we replace this 
data structure with a recent one by Thorup [Tho03]. 

1.5 Compressed String Matching 

Compressed string matching covers problems that involve searching for an (uncompressed) pattern in a 
compressed target text without decompressing it. The goal is to search more efficiently than the obvious 
approach of decompressing the target and then performing the matching. Modern text data bases, e.g., for 
biological data and World Wide Web data, are huge. To save time and space the data must be kept in 
compressed form while allowing searching. Therefore, efficient algorithms for compressed string matching 
are needed. 

Amir and Benson [AB92b, AB92a] initiated the study of compressed string matching. Subsequently, sev- 
eral researchers have proposed algorithms for various types of string matching problems and compression 
methods [AB92b,FT98,KTS+98,KNU03,Nav03,MUN03]. For instance, given a string Q of length u com- 
pressed with the Ziv-Lempel- Welch scheme [Wel84] into a string of length n, Amir et al. [ABF96] gave an 
algorithm for finding all exact occurrences of a pattern string of length m in 0(n + to 2 ) time and space. 
Algorithms for fully compressed pattern matching, where both the pattern and the target are compressed 
have also been studied (see the survey by Rytter [Ryt99]). 

In Chapter 7, we study approximate string matching and regular expression matching problems in the 
context of compressed texts. As in previous work on these problems [KNU03, Nav03] we focus on the 
popular ZL78 and ZLW compression schemes [ZL78, Wel84]. These compression schemes adaptively divide 
the input into substrings, called phrases, which can be compactly encoded using references to other phrases. 
During encoding and decoding with the ZL78 /ZLW compression schemes the phrases are typically stored in a 
dictionary trie for fast access. Details of Ziv-Lempel compression can be found in Section 7.2. 

1.5.1 Compressed Approximate String Matching 

Recall that given strings P and Q and an error threshold k, the approximate string matching problem is to 
find all ending positions of substrings of Q whose unit-cost string edit distance to P is at most k. Let m 
and u denote the length of P and Q, respectively. For our purposes we are particularly interesting in the 
fast algorithms for small values of k, namely, the 0(uk) time algorithm by Landau and Vishkin [LV89] and 
the more recent 0(uk 4 /m + u) time algorithm due to Cole and Hariharan [CH02] (we assume w.l.o.g. that 
k < to). Both of these can be implemented in 0(m) space. 

Karkkalnen et al. [KNU03] initiated the study of compressed approximate string matching with the 
ZL78/ZLW compression schemes. If n is the length of the compressed text, their algorithm achieves 0(nmk + 
occ) time and 0(nmk) space, where occ is the number of occurrences of the pattern. For special cases and re- 
stricted versions of compressed approximate string matching, other algorithms have been proposed [MKT + 00, 
NR98]. An experimental study of the problem and an optimized practical implementation can be found 
in [NKT+01]. Crochemore et al. [CLZU03] gave an algorithm for the fully compressed version of the prob- 
lem. If to' is the length of the compressed pattern their algorithm runs in 0(um' + nm) time and space. 

Our Results and Techniques In Section 7.3 we show how to efficiently use algorithms for the uncom- 
pressed approximate string matching problem to achieve a simple time-space trade-off. Specifically, let 
t(m, u, k) and s(m, u, k) denote the time and space, respectively, needed by any algorithm to solve the (un- 
compressed) approximate string matching problem with error threshold k for pattern and text of length to 
and u, respectively. We show that if Q is compressed using ZL78 then given a parameter r > 1 we can 
solve compressed approximate string matching in 0(n(r + to + t(m, 2m + 2k, k j) + occ) expected time and 
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0{n/r + m + s(m, 2m + 2k, k) + occ) space (Theorem 17). The expectation is due to hashing and can be 
removed at an additional 0(n) space cost. In this case the bound also hold for ZLW compressed strings. We 
assume that the algorithm for the uncompressed problem produces the matches in sorted order (as is the 
case for all algorithms that we are aware of). Otherwise, additional time for sorting must be included in the 
bounds. 

To compare our result with the algorithm of Karkkainen et al. [KNU03], plug in the Landau- Vishkin 
algorithm and set r = mk. This gives an algorithm using 0(nmk + occ) time and 0(n/mk + to + occ) 
space. These bounds matches the best known time bound while improving the space by a factor Q(m 2 k 2 ). 
Alternatively, if we plug in the Cole-Hariharan algorithm and set r = k 4 + to we get an algorithm using 
0(nk 4 + nm + occ) time and 0(n/(k 4 + m) + m+ occ) space. Whenever k = 0(m x l 4 ) this is 0(nm + occ) 
time and 0(n/m + to + occ) space. 

The key idea for our result is a simple o(n) space data structure for ZL78 compressed texts. This data 
structures compactly represents a subset of the dynamic dictionary trie whose size depends on the parameter 
t. Combined with the compressed text the data structure enables fast access to relevant parts of the trie, 
thereby allowing algorithms to solve compressed string matching problems in o{n) space. To the best of 
our knowledge, all previous non-trivial compressed string matching algorithm for ZL78/ZLW compressed text, 
with the exception of a very slow algorithm for exact string matching by Amir et al. [ABF96], explicitly 
construct the trie and therefore use Q(n) space. 

Our bound depends on the special nature of ZL78 compression scheme and do not in general hold for 
ZLW compressed texts. However, whenever we use fi(n) space in the trade-off we have sufficient space to 
explicitly construct the trie and therefore do not need our o(n) space data structure. In this case the bound 
holds for ZLW compressed texts and hashing is not needed. Note that even with O(n) space we significantly 
improve the previous bounds. 

1.5.2 Compressed Regular Expression Matching 

Let R be a regular expression and let Q be string. Recall that deciding if Q £ L(R) and finding all occurrences 
of substrings of Q matching L(R) was the same problem for all of the finite automaton-based algorithms 
discussed in Section 1.4.2. In the compressed setting this is not the case since the complexities we obtain 
for the substring variant of the problem may be dominated by the number of reported occurrences. In this 
section we therefore define regular expression matching as follows: Given a regular expression R and a string 
Q, the regular expression matching problem is to find all ending positions of substrings in Q matching a 
string in L(R). 

The only solution to the compressed problem is due to Navarro [Nav03], who studied the problem on 
ZL78/ZLW compressed strings. This algorithms depends on a complicated mix of Four Russian techniques and 
word-level parallelism. As a similar improvement is straightforward to obtain for our algorithm we ignore 
these factors in the bounds presented here. With this simplification Navarro's algorithm uses 0(nm 2 + 
occ ■ to log to) time and 0(nm 2 ) space, where to and n are the lengths of the regular expression and the 
compressed text, respectively. 

Our Results and Techniques We show that if Q is compressed using ZL78 or ZLW then given a parameter 
r > 1 we can solve compressed regular expression matching in 0(nm(m + r) + occ ■ to log to) time and 
0(nm 2 /r+nm) space (Theorem 18). If we choose r = to we obtain an algorithm using 0{nm 2 + occ- to log to) 
time and 0(nm) space. This matches the best known time bound while improving the space by a factor 
6 (to). With word-parallel techniques these bounds can be improved slightly. The full details are given in 
Section 7.4.5. 

As in the previous section we obtain this result by representing information at a subset of the nodes in 
dictionary trie depending on the parameter r. In this case the total space used is always Q(n) and therefore 
we have sufficient space to store the trie. 
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1.6 Core Techniques 



In this section we identify the core techniques used in this dissertation. 
1.6.1 Data Structures 

The basic goal of data structures is to organize information compactly and support fast queries. Hence, 
it is not surprising that using the proper data structures in the design of pattern matching algorithms is 
important. A good example is the tree data structures used in our algorithms for the tree inclusion problem 
(Chapter 3). Let T be a rooted and labeled tree. A node z is a common ancestor of nodes v and w if it is an 
ancestor of both v and w. The nearest common ancestor of v and w is the common ancestor of v and w of 
maximum depth. The nearest common ancestor problem is to preprocess T into a data structure supporting 
nearest common ancestor queries. Several linear-space data structures for the nearest common ancestor 
problem that supports queries in constant time are known [HT84,BFC00, AGKR04]. The first ancestor of w 
labeled a is the ancestor of w of maximum depth labeled a. The tree color problem is to preprocess T into 
a data structure supporting first label queries. This is well-studied problem [Die89, MM96, FM96, AHR98]. 
In particular, Dietz [Die89] gave a linear space solution supporting queries in O(loglognT) time. We use 
data structures for both the nearest common ancestor problem and the tree color problem extensively in 
our algorithms for the tree inclusion problem. More precisely, let v be a node in P with children vi,...,Vk- 
After computing which subtrees of T that include each of the subtrees of P rooted at v\,...,Vk we find 
the subtrees of T that include the subtree of P rooted at v using a series of nearest common ancestor and 
first label queries. Much of the design of our algorithms for tree inclusion was directly influenced by our 
knowledge of these data structures. 

We use dictionaries in many of our results to handle large alphabets efficiently. Given a subset S of 
elements from a universe U a dictionary preprocesses S into a data structure supporting membership queries 
of the form: "Is x <G 5?" for any x £ U. The dictionary also supports retrieval of satellite data associated 
with x. In many of our results we rely on a dictionary construction due to Hagerup et al. [HMP01]. They 
show how to preprocess a subset S of n elements from the universe U = {0, 1} W in O(nlogn) time into an 
0(n) space data structure supporting membership queries in constant time. The preprocessing makes heavy 
uses of weak non- uniformity to obtain an error correcting code. A suitable code can be computed in 0(w2 w ) 
time and no better algorithm than brute force search is known. In nearly all of our algorithms that use this 
dictionary data structure we only work with polynomial sized universes. In this case, the dictionary can be 
constructed in the above stated bound without the need for weak non-uniformity. The only algorithms in 
this dissertation that use larger universes are our algorithms for regular expression matching in Chapter 6. 
Both of these algorithms construct a deterministic dictionary for m elements in 0(m\ogm) time. However, 
in the first algorithm (Section 6.4) we may replace the dictionary with another dictionary data structure 
by Ruzic [Ruz04] that runs in 0(m 1+e ) preprocessing time and does not use weak non- uniformity. Since 
the total running time of the algorithm is Q(m 2 ) this does not affect our result. Our second algorithm 
(Section 6.5) uses fi(logm) time in each step of the simulation and therefore we may simply use a sorted 
array and binary searches to perform the lookup. 

A key component in our result for compressed string matching (Chapter 7) is an efficient dictionary 
for sets that dynamically change under insertions of elements. This is needed to maintain our sublinear 
space data structure for representing a subset of the trie while the trie is dynamically growing through 
additions of leaves (see Section 7.2.1). For this purpose we use the dynamic perfect hashing data structure 
by Dietzfelbinger et al. [DKM+94] that supports constant time membership queries and constant amortized 
expected time insertions and deletions. 

Finally, for the subsequence indexing problem, presented in Section 5.4, we use the van Emde Boas 
predecessor data structure(vEB) [vEB77, vEBKZ77]. For x integers in the range [1,A] a vEB answers 
queries in O(loglogA) time and combined with perfect hashing the space complexity is O(x) [MN90]. To 
get the full trade-off we replace the vEB with a more recent data structure by Thorup [Tho03, Thm. 2]. This 
data structure supports successor queries of x integers in the range [1, A] using 0(xX x / 2 ) preprocessing 
time and space with query time 0(1 + 1), for < I < log log A. Patrascu and Thorup [PT06] recently showed 
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that in linear space the time bounds for the van Emde Boas data structures are optimal. Since predecessor 
searches is the computational bottleneck in our algorithms for subsequence queries we cannot hope to get 
an 0(n) space data and 0(m) query time using the techniques presented in Section 5.4. 

1.6.2 Tree Techniques 

Several combinatorial properties of trees are used extensively throughout the dissertation. The simplest 
one is the heavy-path decomposition [HT84]. The technique partitions a tree into disjoint heavy-paths, such 
that at most a logarithmic number of distinct heavy-paths are encountered on any root-to-leaf path (see 
Section 4.4.1 for more details). The heavy-path decomposition is used in Klein's algorithm [Klc98] (presented 
in Section 2.3.2.3) to achieve a worst-case efficient algorithm for tree edit distance. To improve the space of the 
constrained tree edit distance problem and tree alignment Wang and Zhang [WZ05] order the computation 
of children of nodes according to a heavy path decomposition. In our worst-case algorithm for the tree path 
subsequence problem (Section 4.4) we traverse the target tree according to a heavy-path traversal to reduce 
the space of an algorithm from 0(npnr) to 0(np\ogriT). 

Various forms of grouping or clustering of nodes in trees is used extensively. Often the relationship 
between the clusters is represented as another tree called a macro tree. In particular, in our third algorithm 
for the tree inclusion problem (Section 3.5) we cluster the target tree into small logarithmic sized subtrees 
overlapping in at most two nodes. A macro tree is used to represent the overlap between the clusters and 
internal properties of the clusters. This type of clustering is well-known from several tree data structures 
see e.g., [AHTOO, AHdLT97,Fre97], and the macro-tree representation is inspired by a related construction 
of Alstrup and Rauhc [AR02]. 

In our worst-case algorithm for tree path subsequence (Section 4.4), a simpler tree clustering due to 
Gabow and Tarjan [GT83] is used. Here we cluster the pattern tree into logarithmic sized subtrees that 
may overlap only in their roots. We also construct a macro tree from these overlaps. Note that to achieve 
our worst-case bound for the tree path subsequence we are both clustering the pattern tree and using a 
heavy-path decomposition of the target tree. 

For the regular expression matching and approximate regular expression problem we cluster TNFAs into 
small subautomata of varying sizes (see Sections 5.2.3 and 6.3.1). This clustering is based on a clustering 
of the parse tree of the regular expression and is similar to the one by Gabow and Tarjan [GT83]. Our 
second word-level parallel algorithm for regular expression matching (Section 6.5) uses a recursive form of 
this clustering on subautomata of TNFAs to efficiently traverse paths of e-transitions in parallel. 

For the subsequence indexing problem we cluster the DASG according to the size of the alphabet. The 
clusters are represented in a macro DASG. 

Finally, for compressed string matching we show how to efficiently select a small subset of nodes in the 
dynamic dictionary trie such that the minimum distance from a node to a node is bounded by a given 
parameter. 

1.6.3 Word-RAM Techniques 

The Four Russian technique [ADKF70] is used in several algorithms to achieve speedup. The basic idea 
is to tabulate and encode solutions to all inputs of small subproblems, and use this to achieve a speedup. 
Combined with tree clustering we use the Four Russian technique in our worst-case efficient algorithms 
for tree inclusion and tree path subsequence (Sections 3.5 and 4.4) to achieve logarithmic speedups. Our 
results for string edit distance, regular expression matching, and approximate regular expression matching 
are improvements of previously known Four Russian techniques for these problems (see Sections 5.5, 5.2, 
and 5.3). 

Four Russian techniques have been widely used. For instance, many of the recent subcubic algorithms 
for the all-pairs-shortest-path problem make heavy use of this technique [Tak04,Zwi04,Han04,Cha06,Han06, 
Cha07]. 

Our latest results for regular expression matching (Chapter 6) does not use the Four Russian technique. 
Instead of simulating the automata using table-lookups we simulate them using the instruction set of the 
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word RAM. This kind of technique is often called word-level parallelism. Compared to our Four Russian 
algorithm this more space-efficient since the large tables are avoided. Furthermore, the speedup depends on 
the word length rather than the available space for tables and therefore our algorithm can take advantage 
of machines with long word length. 

Word-level parallelism has been used in many areas of algorithms. For instance, in the fast algorithms 
for sorting integers [vEB77, FW93, AH97, AHNR98, HT02] . Within the area of string matching many of the 
fastest practical algorithms are based on word-level parallel techniques, see e.g., [BYG92, Mye99, NavOla]. 
In string matching, the term bit-parallelism, introduced by Baeza- Yates [BY89] , is often used instead of the 
term word-level parallelism. 

1.7 Discussion 

I will conclude this introduction by discussing which of the contributions in this dissertation I find the most 
interesting. 

First, I want to mention our results for the tree inclusion problem. As the volume of tree structured 
data is growing rapidly in areas such as biology and image analysis I believe algorithms for querying of trees 
will become very important in the near future. Our work shows how to obtain a fast and space-efficient 
algorithm for a very simple tree query problem, but the ideas may be useful to obtain improved results for 
more sophisticated tree query problems. 

Secondly, I want to mention our results for the regular expression matching using word-level parallelism. 
Modern computers have large word lengths and support a sophisticated set of instructions, see e.g., [PWW97, 
TONH96,TH99,OFW99,DDHS00]. Taking advantage of such features is a major challenge for the string 
matching community. Some of the steps used in our regular expression matching algorithms resemble some 
of these sophisticated instructions, and therefore it is likely possible to implement a fast practical version 
of the algorithm. We believe that some of the ideas may also be useful to improve other string matching 
problems. 

Finally, I want to mention our results for compressed string matching. Almost all of the available 
algorithms for compressed string matching problems require space at least linear in the size of the compressed 
text. Since space is a likely bottleneck in practical situations more space-efficient algorithms are needed. Our 
work solves approximate string matching efficiently using sublinear space, and we believe that the techniques 
may be useful in other compressed string matching problems. 
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Abstract 

We survey the problem of comparing labeled trees based on simple local operations of deleting, 
inserting, and relabeling nodes. These operations lead to the tree edit distance, alignment distance, and 
inclusion problem. For each problem we review the results available and present, in detail, one or more 
of the central algorithms for solving the problem. 

2.1 Introduction 

Trees are among the most common and well-studied combinatorial structures in computer science. In 
particular, the problem of comparing trees occurs in several diverse areas such as computational biol- 
ogy, structured text databases, image analysis, automatic theorem proving, and compiler optimization 
[Tai79,ZS89,KM95a,KTSK00,HO82,RR92,ZSW94]. For example, in computational biology, computing 
the similarity between trees under various distance measures is used in the comparison of RNA secondary 
structures [ZS89, JWZ95]. 

Let T be a rooted tree. We call T a labeled tree if each node is a assigned a symbol from a fixed finite 
alphabet E. We call T an ordered tree if a left-to-right order among siblings in T is given. In this paper we 
consider matching problems based on simple primitive operations applied to labeled trees. If T is an ordered 
tree these operations are defined as follows: 

relabel Change the label of a node v in T. 

delete Delete a non-root node v in T with parent v', making the children of v become the children of v' . 
The children are inserted in the place of v as a subsequence in the left-to-right order of the children of 
v'. 

insert The complement of delete. Insert a node v as a child of v' in T making v the parent of a consecutive 
subsequence of the children of v' . 

Figure 2.1 illustrates the operations. For unordered trees the operations can be defined similarly. In this 
case, the insert and delete operations works on a subset instead of a subsequence. We define three problems 
based on the edit operations. Let T\ and T 2 be labeled trees (ordered or unordered). 

Tree edit distance Assume that we are given a cost function defined on each edit operation. An edit 
script S between 7\ and T 2 is a sequence of edit operations turning T\ into T 2 . The cost of S is the sum of 
the costs of the operations in S. An optimal edit script between T\ and T 2 is an edit script between T\ and 
T 2 of minimum cost and this cost is the tree edit distance. The tree edit distance problem is to compute the 
edit distance and a corresponding edit script. 
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Figure 2.1: (a) A relabeling of the node label li to l 2 . (b) Deleting the node labeled ^2- (c) Inserting a node 
labeled I2 as the child of the node labeled l\. 



Tree alignment distance Assume that we are given a cost function defined on pair of labels. An alignment 
A of T\ and T 2 is obtained as follows. First we insert nodes labeled with spaces into T\ and T 2 so that they 
become isomorphic when labels are ignored. The resulting trees are then overlayed on top of each other 
giving the alignment A, which is a tree where each node is labeled by a pair of labels. The cost of A is 
the sum of costs of all pairs of opposing labels in A. An optimal alignment of T\ and T2 is an alignment of 
minimum cost and this cost is called the alignment distance of T\ and T 2 . The alignment distance problem 
is to compute the alignment distance and a corresponding alignment. 

Tree inclusion T\ is included in T2 if and only if T\ can be obtained by deleting nodes from T2. The tree 
inclusion problem is to determine if T\ is included in T 2 . 

In this paper we survey each of these problems and discuss the results obtained for them. For reference, 
Table 2.1 on page 21 summarizes most of the available results. All of these and a few others are covered 
in the text. The tree edit distance problem is the most general of the problems. The alignment distance 
corresponds to a kind of restricted edit distance, while tree inclusion is a special case of both the edit and 
alignment distance problem. Apart from these simple relationships, interesting variations on the edit distance 
problem has been studied leading to a more complex picture. 

Both the ordered and unordered version of the problems are reviewed. For the unordered case, it turns 
out that all of the problems in general are NP-hard. Indeed, the tree edit distance and alignment distance 
problems are even MAX SNP-hard [ALM + 98]. However, under various interesting restrictions, or for special 
cases, polynomial time algorithms are available. For instance, if we impose a structure preserving restriction 
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Tree edit distance 
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Tree alignment distance 
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Tree inclusion 
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Table 2.1: Results for the tree edit distance, alignment distance, and inclusion problem listed according to variant. D i7 Li, and Ii denotes the 
depth, the number of leaves, and the maximum degree respectively of Tj, i = 1,2. The type is either O for ordered or U for unordered. The 
value u is the unit cost edit distance between T\ and T 2 and the value s is the number of spaces in the optimal alignment of T\ and T 2 . The 
value £ti is set of labels used in T\ and toti,t 2 is the number of pairs of nodes in T\ and T 2 which have the same label. 



on the unordered tree edit distance problem, such that disjoint subtrees are mapped to disjoint subtrees, 
it can be solved in polynomial time. Also, unordered alignment for constant degree trees can be solved 
efficiently. 

For the ordered version of the problems polynomial time algorithms exists. These are all based on the 
classic technique of dynamic programming (see, e.g., [CLRS01, Chapter 15]) and most of them are simple 
combinatorial algorithms. Recently however, more advanced techniques such as fast matrix multiplication 
have been applied to the tree edit distance problem [CheOl]. 

The survey covers the problems in the following way. For each problem and variations of it we review 
results for both the ordered and unordered version. This will in most cases include a formal definition of 
the problem, a comparison of the available results and a description of the techniques used to obtain the 
results. More importantly, we will also pick one or more of the central algorithms for each of the problems 
and present it in almost full detail. Specifically, we will describe the algorithm, prove that it is correct, 
and analyze its time complexity. For brevity, we will omit the proofs of a few lemmas and skip over some 
less important details. Common for the algorithms presented in detail is that, in most cases, they are the 
basis for more advanced algorithms. Typically, most of the algorithms for one of the above problems are 
refinements of the same dynamic programming algorithm. 

The main technical contribution of this survey is to present the problems and algorithms in a common 
framework. Hopefully, this will enable the reader to gain a better overview and deeper understanding 
of the problems and how they relate to each other. In the literature, there are some discrepancies in 
the presentations of the problems. For instance, the ordered edit distance problem was considered by 
Klein [Kle98] who used edit operations on edges. He presented an algorithm using a reduction to a problem 
defined on balanced parenthesis strings. In contrast, Zhang and Shasha [ZS89] gave an algorithm based on 
the postorder numbering on trees. In fact, these algorithms share many features which become apparent if 
considered in the right setting. In this paper we present these algorithms in a new framework bridging the 
gap between the two descriptions. 

Another problem in the literature is the lack of an agreement on a definition of the edit distance problem. 
The definition given here is by far the most studied and in our opinion the most natural. However, several 
alternatives ending in very different distance measures have been considered [Lu79, TT88, Sel77, Lu84]. In 
this paper we review these other variants and compare them to our definition. We should note that the edit 
distance problem defined here is sometimes referred to as the tree-to-tree correction problem. 

This survey adopts a theoretical point of view. However, the problems above are not only interesting 
mathematical problems but they also occur in many practical situations and it is important to develop 
algorithms that perform well on real-life problems. For practical issues see, e.g., [WZJS94,TSKK98,SWSZ02]. 

We restrict our attention to sequential algorithms. However, there has been some research in parallel 
algorithms for the edit distance problem, e.g., [ZS89, Zha96b, SZ90]. 

This summarizes the contents of this paper. Due to the fundamental nature of comparing trees and its 
many applications several other ways to compare trees have been devised. In this paper, we have chosen to 
limit ourselves to a handful of problems which we describe in detail. Other problems include tree pattern 
matching [Kos89, DGM90] and [H082, RR92, ZSW94], maximum agreement subtree [KA94,FT94], largest 
common subtree [AH94, KMY95], and smallest common supertree [NRTOO, GN98]. 

2.1.1 Outline 

In Section 2.2 we give some preliminaries. In Sections 2.3, 2.4, and 2.5 we survey the tree edit distance, 
alignment distance, and inclusion problems respectively. We conclude in Section 2.6 with some open prob- 
lems. 

2.2 Preliminaries and Notation 

In this section we define notations and definitions we will use throughout the paper. For a graph G we 
denote the set of nodes and edges by V(G) and E(G) respectively. Let T be a rooted tree. The root of T 
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is denoted by root(T). The size of T, denoted by |T|, is \V(T)\. The depth of a node v £ V(T), depth(u), 
is the number of edges on the path from v to root(T). The in- degree of a node v, deg(w) is the number of 
children of v. We extend these dehnitions such that dcpth(T) and deg(T) denotes the maximum depth and 
degree respectively of any node in T. A node with no children is a leaf and otherwise an internal node. The 
number of leaves of T is denoted by leaves(T). We denote the parent of node v by parent(w). Two nodes 
are siblings if they have the same parent. For two trees T\ and T 2 , we will frequently refer to leaves(Ti), 
depth(Tj), and deg(T;) by L u D l , and i = 1,2. 

Let 9 denote the empty tree and let T(v) denote the subtree of T rooted at a node v £ V(T). If 
w £ V(T(v)) then v is an ancestor of w, and if w £ V(T(v))\{v} then v is a proper ancestor of w. If v is 
a (proper) ancestor of w then w is a (proper) descendant of v. A tree T is ordered if a left-to-right order 
among the siblings is given. For an ordered tree T with root v and children v\ , . . . , Vi, the preorder traversal 
of T(v) is obtained by visiting v and then recursively visiting T(vk), 1 < k < i, in order. Similarly, the 
postorder traversal is obtained by first visiting T{vk), 1 < k < i, and then v. The preorder number and 
postorder number of a node w £ T(v), denoted by pre(w) and post(w), is the number of nodes preceding w 
in the preorder and postorder traversal of T respectively. The nodes to the left of w in T is the set of nodes 
u £ V(T) such that pre(u) < pre(w) and post(u) < post(w). If u is to the left of w then w is to the right of 
u. 

A forest is a set of trees. A forest F is ordered if a left-to-right order among the trees is given and each 
tree is ordered. Let T be an ordered tree and let v £ V(T). If v has children vi,...,Vi define F(v s ,v t ), 
where 1 < s < t < i, as the forest T(v s ), . . . , T(v r ). For convenience, we set F(v) = F(vi,Vi). 

We assume throughout the paper that labels assigned to nodes are chosen from a finite alphabet S. 
Let A ^ £ denote a special blank symbol and define T,\ — X U A. We often define a cost function, 7 : 
(E\ x E A )\(A, A) — > R, on pairs of labels. We will always assume that 7 is a distance metric. That is, for 
any I1J2M € the following conditions are satisfied: 

1. j{h,l 2 )>0, j{h,h)=0- 

2. 7(Ji,J2)=7(fe,Ji). 

3. j(h,h) <l{hM)+l(h,h)- 

2.3 Tree Edit Distance 

In this section we survey the tree edit distance problem. Assume that we are given a cost function defined 
on each edit operation. An edit script S between two trees T\ and T 2 is a sequence of edit operations turning 
T\ into T 2 . The cost of S is the sum of the costs of the operations in S. An optimal edit script between 
T\ and T 2 is an edit script between T\ and T 2 of minimum cost. This cost is called the tree edit distance, 
denoted by 6(Ti,T 2 ). An example of an edit script is shown in Figure 2.2. 

The rest of the section is organized as follows. First, in Section 2.3.1, we present some preliminaries and 
formally define the problem. In Section 2.3.2 we survey the results obtained for the ordered edit distance 
problem and present two of the currently best algorithms for the problem. The unordered version of the 
problem is reviewed in Section 2.3.3. In Section 2.3.4 we review results on the edit distance problem when 
various structure-preserving constraints are imposed. Finally, in Section 2.3.5 we consider some other variants 
of the problem. 

2.3.1 Edit Operations and Edit Mappings 

Let T\ and T 2 be labeled trees. Following [Tai79] we represent each edit operation by (l\ —* l 2 ), where 
(h,l 2 ) £ (Sa x Sa)\(A, A). The operation is a relabeling if Zi 7^ A and l 2 ^ A, a deletion if l 2 = A, and 
an insertion if l\ = A. We extend the notation such that (v — > w) for nodes v and w denotes (label(u) — > 
label(w)). Here, as with the labels, v or w may be A. Given a metric cost function 7 defined on pairs 
of labels we define the cost of an edit operation by setting 7(^1 — > l 2 ) = 7(^1, l 2 ). The cost of a sequence 
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Figure 2.3: The mapping corresponding to the edit script in Figure 2.2. 



S = s\, . . . , Sk of operations is given by 7(6') = J2i=i 7( s «)- The edit distance, S(T\, T 2 ), between T\ and T 2 
is formally defined as: 

S(Ti, T 2 ) = mm{'y(S) | S is a sequence of operations transforming T\ into T 2 }. 

Since 7 is a distance metric S becomes a distance metric too. 

An edit distance mapping (or just a mapping) between T\ and T 2 is a representation of the edit operations, 
which is used in many of the algorithms for the tree edit distance problem. Formally, define the triple 
(M,Ti,T 2 ) to be an ordered edit distance mapping from 7\ to T 2 , if M C V(Ti) x V(T 2 ) and for any pair 

wi), (v 2 ,w 2 ) £ M: 

1. v\ = v 2 iff w\ = w 2 . (one-to-one condition) 

2. vi is an ancestor of v 2 iff wi is an ancestor of w 2 . (ancestor condition) 

3. vi is to the left of v 2 iff w\ is to the left of w 2 . (sibling condition) 

Figure 2.3 illustrates a mapping that corresponds to the edit script in Figure 2.2. We define the unordered 
edit distance mapping between two unordered trees as the same, but without the sibling condition. We will 
use M instead of (M, T\, T 2 ) when there is no confusion. Let (M, 7\, T 2 ) be a mapping. We say that a node 
v in Ti or T 2 is touched by a line in M if v occurs in some pair in M. Let N\ and N 2 be the set of nodes in 
T\ and T 2 respectively not touched by any line in M. The cost of M is given by: 

Mappings can be composed. Let T\, T 2 , and T 3 be labeled trees. Let M x and M 2 be a mapping from I\ to 
T 2 and T 2 to r 3 respectively. Define 

Mi o M 2 = {(u,io) I 3u e T^(T 2 ) such that (u,u) e Mi and e M 2 } 
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With this definition it follows easily that M\ o M 2 itself becomes a mapping from T\ to T3. Since 7 is a 
metric, it is not hard to show that a minimum cost mapping is equivalent to the edit distance: 

5{T\,T 2 ) = min{7(M) | (M,Ti,T 2 ) is an edit distance mapping}. 

Hence, to compute the edit distance we can compute the minimum cost mapping. We extend the definition 
of edit distance to forests. That is, for two forests F\ and F 2 , 8(F\,F 2 ) denotes the edit distance between 
F\ and F 2 . The operations are defined as in the case of trees, however, roots of the trees in the forest may 
now be deleted and trees can be merged by inserting a new root. The definition of a mapping is extended 
in the same way. 



2.3.2 General Ordered Edit Distance 

The ordered edit distance problem was introduced by Tai [Tai79] as a generalization of the well-known string 
edit distance problem [WF74]. Tai presented an algorithm for the ordered version using 0(|Ti||T 2 ||Li| 2 |L 2 | 2 ) 
time and space. Subsequently, Zhang and Shasha [ZS89] gave a simple algorithm improving the bounds 
to 0(|Ti||T 2 | min(Li, Di) min(L 2 , D 2 )) time and 0(|7i||T 2 |) space. This algorithm was modified by Klein 
[Klc98] to get a better worst case time bound of O ( | T\ | 2 1 T2 1 log | T2 1 ) 1 under the same space bounds. We 
present the latter two algorithms in detail below. Recently, Chen [CheOl] has presented an algorithm using 
0(|Ti||T 2 | + L\\T 2 \ + L\ b L 2 ) time and 0((|7i| + L\) mh\(L 2 ,D 2 ) + \T 2 \) space. Hence, for certain kinds of 
trees the algorithm improves the previous bounds. This algorithm is more complex than all of the above 
and uses results on fast matrix multiplication. Note that in the above bounds we can exchange T\ with T 2 
since the distance is symmetric. 



2.3.2.1 A Simple Algorithm 

We first present a simple recursion which will form the basis for the two dynamic programming algorithms we 
present in the next two sections. We will only show how to compute the edit distance. The corresponding edit 
script can be easily obtained within the same time and space bounds. The algorithm is due to Klein [Kle98]. 
However, we should note that the presentation given here is somewhat different. We believe that our 
framework is more simple and provides a better connection to previous work. 

Let F be a forest and v be a node in F. We denote by F — v the forest obtained by deleting v from F. 
Furthermore, define F — T(v) as the forest obtained by deleting v and all descendants of v. The following 
lemma provides a way to compute edit distances for the general case of forests. 

Lemma 1 Let F\ and F 2 be ordered forests and 7 be a metric cost function defined on labels. Let v and w 
be the rightmost (if any) roots of the trees in F\ and F 2 respectively. We have, 

6(6,6) = 
5{F 1 ,9) = 5(F 1 -v,6)+f(v -» A) 
5(6,F 2 ) = 6(6,F 2 -w)+'y(\^w) 

'S(F 1 -v,F 2 )+'y(v -» A) 
8(F 1 ,F 2 -w)+^(\^> w) 

K S(F 1 {v),F 2 (w))+S(F 1 -T 1 (v),F 2 -T 2 {w))+j{v^w) 



6{F 1 ,F 2 ) = min < 



Proof. The first three equations are trivially true. To show the last equation consider a minimum cost 
mapping M between F\ and F 2 . There are three possibilities for v and w. 

Case 1: v is not touched by a line. Then (v, A) e M and the first case of the last equation applies. 



1 Since the edit distance is symmetric this bound is in fact 0(min(|Ti 2 T2I log IT2I, 1 2~2 1 2 1 I log \Ti\). For brevity we will use 
the short version. 
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Case 2: w is not touched by a line. Then (A, w) £ M and the second case of the last equation applies. 

Case 3: v and w arc both touched by lines. We show that this implies (v,w) £ M. Suppose (v, h) and 
(k, w) arc in M. If v is to the right of k then h must be to right of w by the sibling condition. If v is a 
proper ancestor of k then h must be a proper ancestor of w by the ancestor condition. Both of these 
cases are impossible since v and w are the rightmost roots and hence (v,w) £ M. By the definition of 
mappings the equation follows. □ 

Lemma 1 suggests a dynamic programming algorithm. The value of 5(Fi, F 2 ) depends on a constant number 
of subproblcms of smaller size. Hence, we can compute S(F 1} F 2 ) by computing S(Si,S 2 ) for all pairs of 
subproblcms Si and S 2 in order of increasing size. Each new subproblcm can be computed in constant 
time. Hence, the time complexity is bounded by the number of subproblcms of F\ times the number of 
subproblems of F 2 . 

To count the number of subproblems, define for a rooted, ordered forest F the (i, j)- deleted subforest, 
< i + j < \F\, as the forest obtained from F by first deleting the rightmost root repeatedly j times and 
then, similarly, deleting the leftmost root i times. We call the (0, j)-deleted and (i, 0)-deleted subforests, for 
< j < the prefixes and the suffixes of F respectively. The number of (i, j)-deleted subforests of F is 

Y^k2o k = ^(l^l 2 )' since for each i there are |F| — i choices for j. 

It is not hard to show that all the pairs of subproblems S\ and S 2 that can be obtained by the recursion 
of Lemma 1 are deleted subforests of F\ and F 2 . Hence, by the above discussion the time complexity is 
bounded by 0(\F\ | 2 |i<2| 2 ). In fact, fewer subproblems are needed, which we will show in the next sections. 

2.3.2.2 Zhang and Shasha's Algorithm 

The following algorithm is due to Zhang and Shasha [ZS89]. Define the keyroots of a rooted, ordered tree T 
as follows: 

keyroots(T) = {root(T)} U{u6 V(T) \ v has a left sibling} 

The special subforests of T is the forests F(v), where v £ keyroots(T). The relevant subproblems of T with 
respect to the keyroots is the prefixes of all special subforests F(v). In this section we refer to these as the 
relevant subproblems. 

Lemma 2 For each node v £ V(T), F(v) is a relevant subproblem. 

It is easy to see that, in fact, the subproblcms that can occur in the above recursion are either subforests of 
the form F(v), where v £ V(T), or prefixes of a special subforest of T. Hence, it follows by Lemma 2 and 
the definition of a relevant subproblcm, that to compute S(Fi, F 2 ) it is sufficient to compute 5(Si, S 2 ) for all 
relevant subproblcms S± and S 2 of T\ and T 2 respectively. 

The relevant subproblcms of a tree T can be counted as follows. For a node v £ V(T) define the collapsed 
depth of v, cdcpth(w), as the number of keyroot ancestors of v. Also, define cdcpth(T) as the maximum 
collapsed depth of all nodes v £ V(T). 

Lemma 3 For an ordered tree T the number of relevant subproblems, with respect to the keyroots is bounded 
6yO(|T|cdepth(T)). 

Proof. The relevant subproblems can be counted using the following expression: 

E 1^(^)1 < E l T WI= E cdepth( V ) < |T|cdepth(T) 

uekcyroots(T) -uekcyroots(T) veV(T) 

Since the number prefixes of a subforest F(v) is 1^(^)1 the first sum counts the number of relevant sub- 
problems of F(v). To prove the first equality note that for each node v the number of special subforests 
containing v is the collapsed depth of v. Hence, v contributes the same amount to the left and right side. 
The other equalities/inequalities follow immediately. □ 



2G 



Lemma 4 For a tree T, cdepth(T) < min{depth(T), leaves(T)} 

Thus, using dynamic programming the problem can be solved in 0(|Ti||T 2 | min{Z?i, L{\ mm{D 2} L 2 }) time 
and space. To improve the space complexity we carefully compute the subproblems in a specific order and 
discard some of the intermediate results. Throughout the algorithm wc maintain a table called the permanent 
table storing the distances S(Fi (v), F 2 (w)), vi £ V(Fi) and w 2 £ V(F 2 ), as they are computed. This uses 
O ( | -Fi 1 1 -F2 1 ) space. When the distances of all special subforests of F± and F 2 are availiablc in the permanent 
table, we compute the distance between all prefixes of Fi and F 2 in order of increasing size and store these 
in a table called the temporary table. The values of the temporary table that are distances between special 
subforests are copied to the permanent table and the rest of the values are discarded. Hence, the temporary 
table also uses at most 0(|Fi||F 2 |) space. By Lemma 1 it is easy to see that all values needed to compute 
S(Fi,F 2 ) are availiable. Hence, 

Theorem 1 ( [ZS89]) For ordered trees T\ and T 2 the tree edit distance problem can be solved in time 
0(\T 1 \\T 2 \mm{D 1 ,L 1 }min{D 2 ,L 2 }) and space 0(|Ti||T 2 |). 

2.3.2.3 Klein's Algorithm 

In the worst case, that is for trees with linear depth and a linear number of leaves, Zhang and Shasha's 
algorithm of the previous section still requires 0(|Ti| 2 |T 2 | 2 ) time as the simple algorithm. In [Kle98] Klein 
obtained a better worst case time bound of O ( | T\ | 2 1 T 2 | log | T 2 \ ) . The reported space complexity of the algo- 
rithm is 0(|Ti| 2 |T 2 | log |T 2 |) which is significantly worse than the algorithm of Zhang and Shasha. However, 
according to Klein [Klc02] this algorithm can also be improved to (9(|Ti||T 2 |). 

The algorithm is based on an extension of the recursion in Lemma 1. The main idea is to consider all 
of the (9(|Ti| 2 ) deleted subforests of T x but only 0{\T 2 \ log |T 2 |) deleted subforests of T 2 . In total the worst 
case number of subproblems is thus reduced to the desired bound above. 

A key concept in the algorithm is the decomposition of a rooted tree T into disjoint paths called heavy 
paths. This technique was introduced by Harel and Tarjan [HT84]. We define the size a node v £ V(T) as 
|T(f)|. We classify each node of T as either heavy or light as follows. The root is light. For each internal 
node v we pick a child u of v of maximum size among the children of v and classify u as heavy. The remaining 
children are light. We call an edge to a light child a light edge, and an edge to a heavy child a heavy edge. 
The light depth of a node v, ldepth(w), is the number of light edges on the path from v to the root. 

Lemma 5 ( [HT84]) For any tree T and any v £ V(T), ldepth(u) < log |T| + O(l). 

By removing the light edges T is partitioned into heavy paths. 

We define the relevant subproblems of T with respect to the light nodes below. We will refer to these 
as relevant subproblems in this section. First fix a heavy path decomposition of T. For a node v in T we 
recursively define the relevant subproblems of F(v) as follows: F(v) is relevant. If v is not a leaf, let u be the 
heavy child of v and let I and r be the number of nodes to the left and to the right of u in F(v) respectively. 
Then, the (i, 0)-deleted subforests of F(v), < i < I, and the (/, j)-dcletcd subforests of F(v), < j < r are 
relevant subproblems. Recursively, all relevant subproblems of F{u) are relevant. 

The relevant subproblems of T with respect to the light nodes is the union of all relevant subproblems 
of F(v) where v £ V(T) is a light node. 

Lemma 6 For an ordered tree T the number of relevant subproblems with respect to the light nodes is bounded 
&2/0(|T|ldepth(T)). 

Proof. Follows by the same calculation as in the proof of Lemma 3. 

□ 

Also note that Lemma 2 still holds with this new definition of relevant subproblems. Let S be a relevant 
subproblcm of T and let v 1 and v r denote the leftmost and rightmost root of 5* respectively. The difference 
node of S is cither v r if £ — v r is relevant or vi if S — vi is relevant. The recursion of Lemma 1 compares 
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the rightmost roots. Clearly, we can also choose to compare the leftmost roots resulting in a new recursion, 
which we will refer to as the dual of Lemma 1. Depending on which recursion we use, different subproblcms 
occur. We now give a modified dynamic programming algorithm for calculating the tree edit distance. Let 
Si be a deleted tree of Ti and let S 2 be a relevant subproblem of T%. Let d be the difference node of S 2 . We 
compute 5(Si, S2) as follows. There are two cases to consider: 

f . If d is the rightmost root of S2 compare the rightmost roots of Si and S2 using Lemma 1. 

2. If d is the leftmost root of S 2 compare the leftmost roots of S\ and S 2 using the dual of Lemma 1. 

It is easy to show that in both cases the resulting smaller subproblcms of Si will all be deleted subforests 
of Ti and the smaller subproblems of S 2 will all be relevant subproblems of T 2 . Using a similar dynamic 
programming technique as in the algorithm of Zhang and Shasha we obtain the following: 

Theorem 2 ( [Kle98]) For ordered trees T\ and T 2 the tree edit distance problem can be solved in time and 
space 0(|Ti| 2 |T 2 | log \T 2 \). 

Klein [Klc98] also showed that his algorithm can be extended within the same time and space bounds to 
the unrooted ordered edit distance problem between Ti and T 2 , defined as the minimum edit distance between 
Ti and T 2 over all possible roots of Ti and T 2 . 

2.3.3 General Unordered Edit Distance 

In the following section we survey the unordered edit distance problem. This problem has been shown to 
be NP-complete [ZSS92, Zha89, ZSS91] even for binary trees with a label alphabet of size 2. The reduction 
is from the Exact Cover by 3-set problem [GJ79]. Subsequently, the problem was shown to be MAX-SNP 
hard [ZJ94]. Hence, unless P^NP there is no PTAS for the problem [ALM+98]. It was shown in [ZSS92] 
that for special cases of the problem polynomial time algorithms exists. If T 2 has one leaf, i.e., T 2 is a 
sequence, the problem can be solved in 0(|Ti||T 2 |) time. More generally, there is an algorithm running in 
time 0(|Ti||T 2 | +T 2 !3 i2 (T^ + T>2)|Ti|). Hence, if the number of leaves in T 2 is logarithmic the problem can 
be solved in polynomial time. 

2.3.4 Constrained Edit Distance 

The fact that the general edit distance problem is difficult to solve has led to the study of restricted versions 
of the problem. In [Zha95, Zha96a] Zhang introduced the constrained edit distance, denoted by <5 C , which is 
defined as an edit distance under the restriction that disjoint subtrees should be mapped to disjoint subtrees. 
Formally, 5 c (Ti,T 2 ) is defined as a minimum cost mapping (M c ,Ti,T 2 ) satisfying the additional constraint, 
that for all (vi, Wi), (u 2 , w 2 ), (1*3, W3) G M c : 

• nca(t>i, v 2 ) is a proper ancestor of V3 iff nca(u>i, w 2 ) is a proper ancestor of W3. 

According to [LSTOI], Richter [Ric97b] independently introduced the structure respecting edit distance 
S s . Similar to the constrained edit distance, S s (Ti,T 2 ) is defined as a minimum cost mapping (M s ,Ti,T 2 ) 
satisfying the additional constraint, that for all (v\, w\), (v 2 ,w 2 ), (^3,^3) € M s such that none of v\, v 2 , and 
v% is an ancestor of the others, 

• nca(t>i, v 2 ) = nca(t>i, V3) iff nca(wi,w 2 ) = nca(w;i, W3). 

It is straightforward to show that both of these notions of edit distance are equivalent. Henceforth, we 
will refer to them simply as the constrained edit distance. As an example consider the mappings of Figure 
2.4. (a) is a constrained mapping since nca(vi,t> 2 ) 7^ nca(vi, V3) and nca(wi,w 2 ) 7^ nca(t/;i, W3). (b) is not 
constrained since nca(i>i,v 2 ) = W4 7^ nca(vi,i>3) = W5, while nca(wi,w 2 ) = W4 = nca(u>i, W3). (c) is not 
constrained since nca(vi,t>3) = v$ 7^ nca(v 2 ,t>3), while nca(wi,W3) = v$ 7^ nca(w 2 ,W3) = W4. 

In [Zha95, Zha96a] Zhang presents algorithms for computing minimum cost constrained mappings. For 
the ordered case he gives an algorithm using 0(|Ti||T 2 |) time and for the unordered case he obtains a 
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Figure 2.4: (a) A mapping which is constrained and less-constraincd. (b) A mapping which is less-constrained 
but not constrained, (c) A mapping which is neither constrained nor less-constraincd. 



running time of 0(|Ti||T 2 |(/i + J 2 ) log(/i + J 2 )). Both use space 0(|Ti||T 2 |). The idea in both algorithms 
is similar. Due to the restriction on the mappings fewer subproblem need to be considered and a faster 
dynamic programming algorithm is obtained. In the ordered case the key observation is a reduction to the 
string edit distance problem. For the unordered case the corresponding reduction is to a maximum matching 
problem. Using an efficient algorithm for computing a minimum cost maximum flow Zhang obtains the time 
complexity above. Richter presented an algorithm for the ordered constrained edit distance problem, which 
uses 0(|Ti | IT2I J1/2) time and 0(\Ti\D 2 l2) space. Hence, for small degree, low depth trees this algorithm 
gives a space improvement over the algorithm of Zhang. 

Recently, Lu et al. [LST01] introduced the less-constrained edit distance, Si, which relaxes the constrained 
mapping. The requirement here is that for all (vi,wi), (^2,^2), (f 3,103) € Mi such that none of v\, v%, and 
v$ is an ancestor of the others, depth(nca(ui , V2)) > depth(nca(«i, v^ j), and nca(wi,W3) = nca(i;2, V3) if and 
only if depth(nca(z«i, W2)) > depth(nca(wi, W3)) and nca(wi,W3) = nca(w 2 , W3). 

For example, consider the mappings in Figure 2.4. (a) is less-constrained because it is constrained, 
(b) is not a constrained mapping, however the mapping is less-constrained since depth(nca(t>i, v 2 )) > 
depth(nca(t>i, V3)), nca(t>i, V3) = nca(t>2, V3), nca(wi,w 2 ) = nca(^i, W3), and nca(wi,W3) = nca(w 2 ,W3). (c) 
is not a less-constrained mapping since depth (nca(«i, ^2)) > depth(nca(t>i, t^)) and nca(t>i, V3) = nca(t>2, W3), 
while nca(wi,i«3) 7^ nca(w 2 ,W3) 

In the paper [LST01] an algorithm for the ordered version of the less-constraincd edit distance problem 
using O ( I T\ 1 1 X2 1 ^1 C-^i +^2)) time and space is presented. For the unordered version, unlike the constrained 
edit distance problem, it is shown that the problem is NP-complete. The reduction used is similar to the 
one for the unordered edit distance problem. It is also reported that the problem is MAX SNP-hard. Fur- 
thermore, it is shown that there is no absolute approximation algorithm 2 for the unordered less-constrained 
edit distance problem unless P=NP. 

2 An approximation algorithm A is absolute if there exists a constant c > such that for every instance /, \A(I)—OPT(I)\ < c, 
where A(I) and OPT(I) are the approximate and optimal solutions of / respectively [Mot92]. 
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2.3.5 Other Variants 



In this section we survey results for other variants of edit distance. Let I\ and T 2 be rooted trees. The unit 
cost edit distance between Ti and T 2 is defined as the number of edit operations needed to turn 7\ into T 2 . 
In [SZ90] the ordered version of this problem is considered and a fast algorithm is presented. If u is the 
unit cost edit distance between Ti and T 2 the algorithm runs in 0(u 2 min{|Ti|, |T 2 |} min{Li, L 2 }) time. The 
algorithm uses techniques from Ukkonen [Ukk85b] and Landau and Vishkin [LV89] . 

In [Sel77] Selkow considered an edit distance problem where insertions and deletions are restricted to 
leaves of the trees. This edit distance is sometimes referred to as the 1-degree edit distance. He gave a 
simple algorithm using 0(|Ti||T 2 |) time and space. Another edit distance measure where edit operations 
work on subtrees instead of nodes was given by Lu [Lu79]. A similar edit distance was given by Tanaka 
in [TT88, Tan95]. A short description of Lu's algorithm can be found in [SZ97]. 



2.4 Tree Alignment Distance 

In this section we consider the alignment distance problem. Let Ti and T 2 be rooted, labeled trees and let 7 
be a metric cost function on pairs of labels as defined in Section 2.2. An alignment A of Ti and T 2 is obtained 
by first inserting nodes labeled with A (called spaces) into Ti and T 2 so that they become isomorphic when 
labels are ignored, and then overlaying the first augmented tree on the other one. The cost of a pair of 
opposing labels in A is given by 7. The cost of A is the sum of costs of all opposing labels in A. An optimal 
alignment of Ti and T 2 , is an alignment of Ti and T 2 of minimum cost. We denote this cost by a(Ti,T 2 ). 
Figure 2.5 shows an example (from [JWZ95]) of an ordered alignment. 




be c d (b,b) (c,A) (A,c) (d,d) 

(a) (b) (c) 

Figure 2.5: (a) Tree Ti. (b) Tree T 2 . (c) An alignment of Ti and T 2 . 



The tree alignment distance problem is a special case of the tree editing problem. In fact, it corresponds 
to a restricted edit distance where all insertions must be performed before any deletions. Hence, (5(Ti,T 2 ) < 
a(Ti,T 2 ). For instance, assume that all edit operations have cost 1 and consider the example in Figure 2.5. 
The optimal sequence of edit operations is achieved by deleting the node labeled e and then inserting the 
node labeled /. Hence, the edit distance is 2. The optimal alignment, however, is the tree depicted in (c) 
with a value of 4. Additionally, it also follows that the alignment distance does not satisfy the triangle 
inequality and hence it is not a distance metric. For instance, in Figure 2.5 if T3 is Ti where the node labeled 
e is deleted, then a(Ti, T 3 ) + a(T 3 , T 2 ) = 2 > 4 = a(T u T 2 ). 

It is a well known fact that edit and alignment distance are equivalent in terms of complexity for sequences, 
sec, e.g., Gusficld [Gus97]. However, for trees this is not true which we will show in the following sections. In 
Section 2.4.1 and Section 2.4.2 we survey the results for the ordered and unordered tree alignment distance 
problem respectively. 
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2.4.1 Ordered Tree Alignment Distance 



In this section we consider the ordered tree alignment distance problem. Let 7\ and T 2 be two rooted, 
ordered and labeled trees. The ordered tree alignment distance problem was introduced by Jiang et al. 
in [JWZ95]. The algorithm presented there uses 0(|Ti||T 2 |(7i + I 2 f) time and OQT^^Kh + I 2 )) space. 
Hence, for small degree trees, this algorithm is in general faster than the best known algorithm for the edit 
distance. We present this algorithm in detail in the next section. Recently, in [JL01], a new algorithm was 
proposed designed for similar trees. Specifically, if there is an optimal alignment of T\ and T 2 using at most 
s spaces the algorithm computes the alignment in time 0((\Ti\ + |T 2 |) log(|Ti| + |T 2 |)(7i + h) A s 2 ). This 
algorithm works in a way similar to the fast algorithms for comparing similar sequences, see, e.g., Section 
3.3.4 in [SM97]. The main idea is to speedup the algorithm of Jiang et al. by only considering subtrees of 
Ti and T 2 whose sizes differ by at most 0{s). 

2.4.1.1 Jiang, Wang, and Zhang's Algorithm 

In this section we present the algorithm of Jiang et al. [JWZ95]. We only show how to compute the alignment 
distance. The corresponding alignment can easily be constructed within the same complexity bounds. Let 
7 be a metric cost function on the labels. For simplicity, we will refer to nodes instead of labels, that is, 
we will use (v,w) for nodes v and w to mean (label(w), label(u>)). Here, v or w may be A. We extend the 
definition of a to include alignments of forests, that is, a(F\,F 2 ) denotes the cost of an optimal alignment 
of forest Fi and F 2 . 

Lemma 7 Let v G V{T\) and w G V(T 2 ) with children vi,...,Vi and Wi, . . . ,Wj respectively. Then, 

a(6,9) = 
a(T 1 (v),e)=a(F 1 (v),0)+7(v,\) 
a(6,T 2 (w))=a{0,F 2 (w))+ 1 (\,w) 

i 

a(F 1 (v),e)=J2^(Ti(v k ),e) 
k=i 

3 

a(e,F 2 (w)) =Y,^{e,T 2 {w k )) 

k=l 

Lemma 8 Let v e V(T\) and w e V(T 2 ) with children v\, . . . ,Vi and w\,...,Wj respectively. Then, 

!a(F 1 (v),F 2 (w)) +j(v,w) 
a(6,T 2 (w)) +mm 1 < r <j{a(T 1 (v),T 2 (w r )) - a(6,T 2 (w r )} 
a(7i(v),0) +mini< r < i {a(Ti(u r ),T 2 (u;)) - a(Ti(tv),0)} 

Proof. Consider an optimal alignment A of T\{v) and T 2 (w). There are four cases: (1) (v,w) is a label in 
A, (2) (y, A) and (k,w) are labels in A for some k e V(Ti), (3) (A, w) and (y, h) are labels in A for some 
h e V(T 2 ) or (4) (v,X) and (A, w) are in A. Case (4) need not be considered since the two nodes can be 
deleted and replaced by the single node (v, w) as the new root. The cost of the resulting alignment is by the 
triangle inequality at least as small. 

Case 1: The root of A is labeled by (v,w). Hence, 

a (Ti(v),T 2 (w)) - a{F 1 (v),F 2 {w))+'y{v,w) 

Case 2: The root of A is labeled by (v, A). Hence, k e V(Ti(w s )) for some 1 < r < i. It follows that, 
a(Ti(«),T 2 (w))=a(Ti(«),0)+ mm .{apiW, T 2 (w)) - a(Ti(« r ),0)} 
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Case 3: Symmetric to case 2. □ 

Lemma 9 Let v G V{T\) and w € V(T 2 ) with children V\, . . . ,Vi and w\,...,Wj respectively. For any s, t 
such that 1 < s < i and 1 <t < j, 

a(Fi(ui, v s -i),F 2 (wi,w t -i)) + a(Ti(v s ),T 2 (w t )) 
a(Fi(«i, v s -i), F 2 (wi, w t )) + a(Ti(v s ),8) 
a(Fi(ui, v s ),F 2 (wi, w t -i)) + a(6,T 2 (w t )) 
j(X,wt)+ min {a(Fi(vi, Vk-i), F 2 (wi, w t -i)) 

Kk<s 



a(Fi (vi,v s ),F 2 (wi,w t )) = min < 



+ a(F 1 {v k ,v s ),F 2 {w k ))} 
7(w s ,A)+ min {a(F 1 (vi,v s - 1 ),F 2 (w 1 ,w k -i)) 

l<k<t 



+ a(Fi(u s ), F 2 (w k ,w t ))} 

Proof. Consider an optimal alignment A of F\(vi,v s ) and F 2 (wi,w t ). The root of the rightmost tree in A 
is labeled either (v s ,w t ), {v s ,\) or (X,w t ). 

Case 1: The label is (v s ,w t ). Then the rightmost tree of A must be an optimal alignment of Ti(v s ) and 
T 2 (w t ). Hence, 

a(F 1 (v 1 ,v s ),F 2 (w ll w t )) = a(F 1 (v 1 ,v s _ 1 ),F 2 (w 1 , u> t _i)) + a(T 1 (v s ),T 2 (w t )). 

Case 2: The label is (v s ,X). Then Ti(v s ) is a aligned with a subforest F 2 (w t -k+i,w t ), where < k < t. 
The following subcases can occur: 

2.1 (k = 0). Ti(v s ) is aligned with F 2 (wt-k+i, w t) = 6- Hence, 

a(Fi(y 1 ,v s ),F 2 (wi,w t )) = a(F 1 (v 1 ,v a - 1 ),F 2 (w 1 ,w t )) + aiT^Vs)^). 

2.2 (k = 1). Ti(v s ) is aligned with F 2 {wt-k+i,wt) — T 2 (wt). Similar to case 1. 

2.3 (k > 2). The most general case. It is easy to see that: 

a(Fi(vi,v a ),F 2 (wi,Wt)) = 7(^s>^) + mm u s -i), -^2(^1, w k -i))) 

l<r<t 

+ a(Fi(w s ),F 2 (w fe ,w t )). 

Case 3: The label is (A, w t ). Symmetric to case 2. □ 

This recursion can be used to construct a bottom-up dynamic programming algorithm. Consider a fixed 
pair of nodes v and w with children v\ , . . . , Vi and u>\ , . . . , Wj respectively. We need to compute the values 
ct{Fx{vh, Vk), F 2 (w)) for all 1 < h < k < i, and a(Fi(v), F 2 (wh,Wk)) for all 1 < h < k < j. That is, we need 
to compute the optimal alignment of F\(v) with each subforest of F 2 (w) and, on the other hand, compute 
the optimal alignment of F 2 {w) with each subforest of F\(v). For any s and t, 1 < s < i and 1 < t < j, 
define the set: 

A s ,t = {&{Fi(v s ,Vp), F 2 (w t ,w q )) I s < p <i,t < q < j} 



To compute the alignments described above we need to compute A St i and A\ tt for all 1 < s < i and 1 < t < j. 
Assuming that values for smaller subproblcms are known it is not hard to show that A s ^ t can be computed, 
using Lemma 9, in time 0((i — s) ■ (j — t) ■ (i — s + j — t)) = 0(ij(i + j)). Hence, the time to compute the 
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(i + j) subproblems, A s ^\ and A\ tt , 1 < s < i and 1 < t < j, is bounded by 0(ij(i + j) 2 ). It follows that the 
total time needed for all nodes v and w is bounded by: 

^ ^ 0(deg(w) deg(w)(deg(w) + deg(w)) 2 ) 
wey(Ti) »ey(T 2 ) 

< Yl E 0(deg(«)dcgH(deg(T 1 )+deg(T 2 )) 2 ) 

i;£V(Ti) tuGV(T 2 ) 

<0((/ 1+ / 2 ) 2 ^ ^ deg(«)degH) 
uey(Ti) tuGy(T 2 ) 

< 0(|T 1 ||T 2 |(/ 1 + I 2 ) 2 ) 
In summary, we have shown the following theorem. 

Theorem 3 ( [JWZ95]) For ordered trees T\ and T 2 , the tree alignment distance problem can be solved in 
Oi^W^h+hf) time andO(\T x \\T 2 \{h+h)) space. 

2.4.2 Unordered Tree Alignment Distance 

The algorithm presented above can be modified to handle the unordered version of the problem in a straight- 
forward way [JWZ95]. If the trees have bounded degrees the algorithm still runs in 0(|Ti|T 2 |) time. This 
should be seen in contrast to the edit distance problem which is MAX SNP-hard even if the trees have 
bounded degree. If one tree has arbitrary degree unordered alignment becomes NP-hard [JWZ95]. The 
reduction is, as for the edit distance problem, from the Exact Cover by 3-Sets problem [GJ79]. 

2.5 Tree Inclusion 

In this section we survey the tree inclusion problem. Let T\ and T 2 be rooted, labeled trees. We say that 
T\ is included in T 2 if there is a sequence of delete operations performed on T 2 which makes T 2 isomorphic 
to T\. The tree inclusion problem is to decide if T\ is included in T 2 . Figure 2.6(a) shows an example of an 
ordered inclusion. The tree inclusion problem is a special case of the tree edit distance problem: If insertions 
all have cost and all other operations have cost 1, then T\ can be included in T 2 if and only if 5(T\, T 2 ) = 0. 
According to [Chc98] the tree inclusion problem was initially introduced by Knuth [Knu69] [exercise 2.3.2-22]. 

The rest of the section is organized as follows. In Section 2.5.1 we give some preliminaries and in Section 
2.5.2 and 2.5.3 we survey the known results on ordered and unordered tree inclusion respectively. 

2.5.1 Orderings and Embeddings 

Let T be a labeled, ordered, and rooted tree. We define an ordering of the nodes of T given by v -< v' iff 
post(v) < post(u'). Also, v ^ v' iff v -< v' or v = v' . Furthermore, we extend this ordering with two special 
nodes _L and T such that for all nodes v £ V(T), _L -< v -< T. The left relatives, \r(v), of a node v £ V(T) is 
the set of nodes that are to the left of v and similarly the right relatives, rr(u), are the set of nodes that are 
to the right of v. 

Let T\ and T 2 be rooted labeled trees. We define an ordered embedding (/, T\, T 2 ) as an injective function 
/ : V{T X ) -> V{T 2 ) such that for all nodes v, u e V{T{), 

• label(i>) = label(/(w)). (label preservation condition) 

• v is an ancestor of u iff f(v) is an ancestor of f(u). (ancestor condition) 

• v is to the left of u iff f(v) is to the left of f(u). (sibling condition) 

Hence, embeddings are special cases of mappings (see Section 2.3.1). An unordered embedding is defined 
as above, but without the sibling condition. An embedding (/, Ti,T 2 ) is root preserving if /(root(Ti)) = 
root(T 2 ). Figure 2.6(b) shows an example of a root preserving embedding. 
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Figure 2.6: (a) The tree on the left is included in the tree on the right by deleting the nodes labeled d, a 
and c. (b) The embedding corresponding to (a). 



2.5.2 Ordered Tree Inclusion 

Let T\ and T 2 be rooted, ordered and labeled trees. The ordered tree inclusion problem has been the attention 
of much research. Kilpelainen and Mannila [KM95a] (see also [Kil92]) presented the first polynomial time 
algorithm using 0{\T\ | \T 2 1) time and space. Most of the later improvements are refinements of this algorithm. 
We present this algorithm in detail in the next section. In [Kil92] a more space efficient version of the above 
was given using 0(\Ti\D 2 ) space. In [Ric97a] Richter gave an algorithm using 0(|£ti||T 2 | + toti,t 2 -D2) 
time, where £ti is the alphabet of the labels of T\ and wt 1 ,t 2 is the set matches, defined as the number 
of pairs (v,w) € Ti x T 2 such that label(w) = label(w). Hence, if the number of matches is small the 
time complexity of this algorithm improves the (|Ti||T2|) algorithm. The space complexity of the algorithm 
is 0(|Xti||T 2 | + mT!,T 2 )- In [Che98] a more complex algorithm was presented using 0(Li|T 2 |) time and 
0(Li min{D 2 , L 2 }) space. In [AS01] an efficient average case algorithm was given. 

2.5.2.1 Kilpelainen and Mannila's Algorithm 

In this section we present the algorithm of Kilpelainen and Mannila [KM95a] for the ordered tree inclusion 
problem. Let T\ and T 2 be ordered labeled trees. Define R(Ti,T 2 ) as the set of root-preserving embeddings 
of Ti into T 2 . We define p(v, w), where v G V(Ti) and w G V(T 2 ): 

p(v,w) = min({w' G rr(w) \ 3f G R(T 1 (v),T 2 (w'))} U {T}) 

Hence, p(v, w) is the closest right relative of w which has a root-preserving embedding of T\(v). Further- 
more, if no such embedding exists p(v, w) is T. It is easy to see that, by definition, Ti can be included in T 2 
if and only if p(v, _L) ^ T. The following lemma shows how to search for root preserving embeddings. 

Lemma 10 Let v be a node in T\ with children Vi, . . . , Uj. For a node w in T 2 , define a sequence pi, . . . ,pi 
by setting pi = p(«i,max^ It(w)) and pu = p(vk,Pk-i), for 2 < k < i. There is a root preserving embedding 
f ofTi(v) in T 2 (v) if and only i/label(w) = label(w) and pi G T 2 (w), for all 1 <k<i. 

Proof. If there is a root preserving embedding between T\(v) and T 2 (w) it is straightforward to check that 
there is a sequence pi , 1 < i < k such that the conditions are satisfied. Conversely, assume that pk G T 2 (w) 
for all 1 < k < i and label(w) = label(w). We construct a root-preserving embedding / of Ti(v) into T 2 (w) 
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as follows. Let f(v) — w. By definition of p there must be a root preserving embedding / , 1 < k < i, of 
Ti(vk) in T 2 (pk)- For a node u in T\(vk), 1 < k < i, we set f(u) = f (u). Since G rr(pfc_i), 2 < fc < i, 
and pfe € T 2 {w) for all fc, 1 < k < i, it follows that / is indeed a root-preserving embedding. □ 

Using dynamic programming it is now straightforward to compute p(v,w) for all v € V(T\) and w G 
V{T 2 )- For a fixed node v we traverse T 2 in reverse postorder. At each node w € V(T 2 ) we check if there is 
a root preserving embedding of T\(v) in T2(w). If so we set = w, for all g e lr(w) such that x < q, 

where x is the next root-preserving embedding of Ti(v) in T 2 (w). 

For a pair of nodes v € U(Ti) and w e V r (T 2 ) we test for a root-preserving embedding using Lemma 
10. Assuming that values for smaller subproblems has been computed, the time used is 0(deg(v)). Hence, 
the contribution to the total time for the node w is X^ev(Ti) 0(deg(v)) = 0(\Ti\). It follows that the time 
complexity of the algorithm is bounded by 0(|Ti||T 2 |). Clearly, only (9(|Ti||T 2 |) space is needed to store p. 
Hence, we have the following theorem, 

Theorem 4 ( [KM95a]) For any pair of rooted, labeled, and ordered trees T x and T 2 , the tree inclusion 
problem can be solved in 0(\Ti\\T 2 \) time and space. 

2.5.3 Unordered Tree Inclusion 

In [KM95a] it is shown that the unordered tree inclusion problem is NP-complete. The reduction used is 
from the Satisfiability problem [GJ79]. Independently, Matousek and Thomas [MT92] gave another proof of 
NP-completeness. 

An algorithm for the unordered tree inclusion problem is presented in [KM95a] using 0(|T 1 |/ 1 2 2/l |T 2 |) 
time. Hence, if I\ is constant the algorithm runs in 0(|Ti||T 2 |) time and if I\ = log|T 2 | the algorithm runs 
inOG^IloglT^T,! 3 ). 

2.6 Conclusion 

We have surveyed the tree edit distance, alignment distance, and inclusion problems. Furthermore, we 
have presented, in our opinion, the central algorithms for each of the problems. There are several open 
problems, which may be the topic of further research. We conclude this paper with a short list proposing 
some directions. 

• For the unordered versions of the above problems some are NP-complete while others are not. Char- 
acterizing exactly which types of mappings that gives NP-complete problems for unordered versions 
would certainly improve the understanding of all of the above problems. 

• The currently best worst case upper bound on the ordered tree edit distance problem is the algorithm 
of [Kle98] using 0(|Ti| 2 |T 2 | log \T 2 \). Conversely, the quadratic lower bound for the longest common 
subsequence problem [AHU76] problem is the best general lower bound for the ordered tree edit distance 
problem. Hence, a large gap in complexity exists which needs to be closed. 

• Several meaningful edit operations other than the above may be considered depending on the particular 
application. Each set of operations yield a new edit distance problem for which we can determine the 
complexity. Some extensions of the tree edit distance problem have been considered [CRGMW96, 
CGM97,KTSK00]. 
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Abstract 



Given two rooted, ordered, and labeled trees P and T the tree inclusion problem is to determine 
if P can be obtained from T by deleting nodes in T. This problem has recently been recognized as 
an important query primitive in XML databases. Kilpelainen and Mannila [SI AM J. Comput. 1995] 
presented the first polynomial time algorithm using quadratic time and space. Since then several improved 
results have been obtained for special cases when P and T have a small number of leaves or small depth. 
However, in the worst case these algorithms still use quadratic time and space. In this paper we present 
a new approach to the problem which leads to an algorithm using linear space and subquadratic running 
time. Our algorithm improves all previous time and space bounds. Most importantly, the space is 
improved by a linear factor. This will likely make it possible to query larger XML databases and speed 
up the query time since more of the computation can be kept in main memory. 

3.1 Introduction 

Let T be a rooted tree. We say that T is labeled if each node is assigned a character from an alphabet S 
and we say that T is ordered if a left-to-right order among siblings in T is given. All trees in this paper are 
rooted, ordered, and labeled. A tree P is included in T, denoted P C T, if P can be obtained from T by 
deleting nodes of T. Deleting a node v in T means making the children of v children of the parent of v and 
then removing v. The children are inserted in the place of v in the left-to-right order among the siblings of 
v. The tree inclusion problem is to determine if P can be included in T and if so report all subtrees of T 
that include P. 

Recently, the problem has been recognized as an important query primitive for XML data and has 
received considerable attention, see e.g., [SM02, YLH03, YLH04, ZADR03, SNOO, TRS02]. The key idea is 
that an XML document can be viewed as a tree and queries on the document correspond to a tree inclusion 
problem. As an example consider Figure 3.1. Suppose that we want to maintain a catalog of books for a 
bookstore. A fragment of the tree, denoted D, corresponding to the catalog is shown in (b). In addition 
to supporting full-text queries, such as find all documents containing the word "John", we can also utilize 
the tree structure of the catalog to ask more specific queries, such as "find all books written by John with a 
chapter that has something to do with XML" . We can model this query by constructing the tree, denoted 
Q, shown in (a) and solve the tree inclusion problem: is Q C D? The answer is yes and a possible way to 
include Q in D is indicated by the dashed lines in (c) . If we delete all the nodes in D not touched by dashed 
lines the trees Q and D become isomorphic. Such a mapping of the nodes from Q to D given by the dashed 
lines is called an embedding (formally defined in Section 3.3). 

The tree inclusion problem was initially introduced by Knuth [Knu69, exercise 2.3.2-22] who gave a 
sufficient condition for testing inclusion. Motivated by applications in structured databases [KM93,MR90] 
Kilpelainen and Mannila [KM95a] presented the first polynomial time algorithm using 0(npnT) time and 

*Part of this work was performed while the author was a PhD student at the IT University of Copenhagen. 
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Figure 3.1: Can the tree (a) be included in the tree (b)? It can and an embedding is given in (c). 



space, where np and ut is the number of nodes in P and T, respectively. During the last decade several 
improvements of the original algorithm of [KM95a] have been suggested [Kil92, AS01, Ric97a, Che98]. The 
previously best known bound is due to Chen [Che98] who presented an algorithm using 0{Iptit) time and 
0{lp -min{dT, It}) space. Here, Is and ds denotes the number of leaves and the maximum depth of a tree S, 
respectively. This algorithm is based on an algorithm of Kilpclainen [Kil92]. Note that the time and space 
is still Q{npriT) for worst-case input trees. 

In this paper we present three algorithms which combined improves all of the previously known time and 
space bounds. To avoid trivial cases we always assume that 1 < np < tit- We show the following theorem: 



Theorem 5 For trees P and T the tree inclusion problem can be solved in 0{ut) space with the following 
running times: 



mm < 



'0(l P n T ), 

0(hpIt log log 7jt + tit) 
0(£f^+n T logn T ). 



Hence, when cither P or T has few leaves we obtain fast algorithms. When both trees have many leaves 
and np = £!(log 2 n r), we instead improve the previous quadratic time bound by a logarithmic factor. Most 
importantly, the space used is linear. In the context of XML databases this will likely make it possible to 
query larger trees and speed up the query time since more of the computation can be kept in main memory. 
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3.1.1 Techniques 



Most of the previous algorithms, including the best one [Che98], are essentially based on a simple dynamic 
programming approach from the original algorithm of [KM95a] . The main idea behind this algorithm is the 
following: Let u be a node in P with children v±, . . . , Vi and let w be a node in T with children wi, . . . , Wj. 
Consider the subtrees rooted at v and w, denoted by P(v) and T(w). To decide if P(v) can be included in 
T(w) we try to find a sequence of numbers 1 < X\ < x 2 < ■ ■ ■ < Xj < j such that P(vk) can be included 
in T(w Xk ) for all k, 1 < k < i. If we have already determined whether or not P(v s ) C T(w t ), for all s and 
t, 1 < s < i, 1 < t < j, we can efficiently find such a sequence by scanning the children of v from left to 
right. Hence, applying this approach in a bottom-up fashion we can determine, if P(v) C T(w), for all pairs 
of nodes v in P and w in T . 

In this paper we take a different approach. The main idea is to construct a data structure on T supporting 
a small number of procedures, called the set procedures, on subsets of nodes of T . We show that any such 
data structure implies an algorithm for the tree inclusion problem. We consider various implementations 
of this data structure which all use linear space. The first simple implementation gives an algorithm with 
O(lpnp) running time. As it turns out, the running time depends on a well-studied problem known as 
the tree color problem. We show a direct connection between a data structure for the tree color problem 
and the tree inclusion problem. Plugging in a data structure of Dietz [Die89] we obtain an algorithm with 
0(nplx log log + tit) running time. 

Based on the simple algorithms above we show how to improve the worst-case running time of the set 
procedures by a logarithmic factor. The general idea used to achieve this is to divide T into small trees called 
clusters of logarithmic size which overlap with other clusters in at most 2 nodes. Each cluster is represented 
by a constant number of nodes in a macro tree. The nodes in the macro tree are then connected according to 
the overlap of the cluster they represent. We show how to efficiently preprocess the clusters and the macro 
tree such that the set procedures use constant time for each cluster. Hence, the worst-case quadratic running 
time is improved by a logarithmic factor. 

Throughout the paper we assume a unit-cost RAM model of computation with word size 6(logn T ) and a 
standard instruction set including bitwise boolean operations, shifts, addition, and multiplication. All space 
complexities refer to the number of words used by the algorithm. 

3.1.2 Related Work 

For some applications considering unordered trees is more natural. However, in [MT92,KM95a] this problem 
was proved to be NP-complete. The tree inclusion problem is closely related to the tree pattern matching 
problem [HO82,Kos89,DGM90,CHI99]. The goal is here to find an injective mapping / from the nodes of P 
to the nodes of T such that for every node v in P the zth child of v is mapped to the zth child of f(v). The 
tree pattern matching problem can be solved in (np + ht) log ^ (np + ht) time. Another similar problem 
is the subtree isomorphism problem [Chu87, ST99], which is to determine if T has a subgraph isomorphic to 
P. The subtree isomorphism problem can be solved efficiently for ordered and unordered trees. The best 
algorithms for this problem use 0( "^^J time for unordered trees and Q( ;"g"^ +«t) time for ordered 

trees [Chu87,ST99]. Both use 0(npnr) space. The tree inclusion problem can be considered a special case of 
the tree edit distance problem [Tai79,ZS89,Kle98,DMRW06]. Here one wants to find the minimum sequence 
of insert, delete, and relabel operations needed to transform P into T. Currently the best algorithm for this 
problem uses 0(nTnp(l + log ^)) time [DMRW06]. For more details and references see the survey [Bil05]. 

3.1.3 Outline 

In Section 3.2 we give notation and definitions used throughout the paper. In Section 3.3 a common frame- 
work for our tree inclusion algorithms is given. Section 3.4 present two simple algorithms and then, based 
on these result, we show how to get a faster algorithm in Section 3.5. 
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(a) 



(b) 



m=Si ®=S 2 ®=S 3 0=5*4 

Figure 3.2: In (a) we have {(vi, v 2 , v 3 , v e , v 7 ), (v 1 , v 2 , v 5 , v e , v 7 ), (v 1 , v 4 , v 5 , v 6 , v 7 ), (v 3 , v 4 , v 5 , v 6 , v 7 )} = 
3>(Si, S-2, Si, S3, S4) and thus mop (Si, S 2 , Si, S3, S4) = {(v3,v 7 )}. In (b) we have 3>(Si, S 2 , Si, S3, S4) = 

{(VI,V 2 ,V3,V 5 ,V 7 ), (vi,V 2 ,V 6 ,V$,Vg), (VI,V 2 ,V3,V$,V 9 ), (vi,V 2 ,V3,V 5 ,V 9 ), (VI,V4,V 6 ,V S ,V 9 ), (V3,V4,,V 6 ,V$,V 9 )} 

and thus mop(Si, S 2 , Si, S3, S4) = {(vi, v 7 ), (v 3 ,v 9 )}. 



3.2 Notation and Definitions 

In this section we define the notation and definitions we will use throughout the paper. For a graph G we 
denote the set of nodes and edges by V(G) and E{G), respectively. Let T be a rooted tree. The root of T 
is denoted by root(T). The size of T, denoted by tit, is | V(T)|. The depth of a node v € V(T), depth(u), is 
the number of edges on the path from v to root(T) and the depth of T, denoted dr, is the maximum depth 
of any node in T. The parent of v is denoted parent(w) and the set of children of v is denoted child (w). A 
node with no children is a leaf and otherwise an internal node. The set of leaves of T is denoted L(T) and 
we define It = \L(T)\. We say that T is labeled if each node v is a assigned a character, denoted label(v), 
from an alphabet S and we say that T is ordered if a left-to-right order among siblings in T is given. All 
trees in this paper are rooted, ordered, and labeled. 

Ancestors and Descendants Let T(v) denote the subtree of T rooted at a node v € V(T). If w e V(T(v)) 
then v is an ancestor of w, denoted v -< w, and if w € V(T(v))\{v} then v is a proper ancestor of w, denoted 
v -< w. If v is a (proper) ancestor of w then w is a (proper) descendant of v. A node z is a common ancestor 
of v and w if it is an ancestor of both v and w. The nearest common ancestor of v and w, nca(w, w), is the 
common ancestor of v and w of greatest depth. The first ancestor of w labeled a, denoted &(w, a), is the 
node v such that v < w, label(w) = a, and no node on the path between v and w is labeled a. If no such 
node exists then R(w, a) — _L, where _L ^ V(T) is a special null node. 

Traversals and Orderings Let T be a tree with root v and let V\ , . . . , Vk be the children of v from left- 
to-right. The preorder traversal of T is obtained by visiting v and then recursively visiting T(vi), 1 < i < k, 
in order. Similarly, the postorder traversal is obtained by first visiting T(v{), 1 < i < k, in order and then 
v. The preorder number and postorder number of a node w <G T(v), denoted by pre(w) and post(w), is the 
number of nodes preceding w in the preorder and postorder traversal of T, respectively. The nodes to the 
left of w in T is the set of nodes u e V(T) such that prc(u) < pre(w) and post(u) < post(w). If u is to the 
left of w, denoted by u < w, then w is to the right of u. If u < w, u -< w, or w ~< u we write u<w. The null 
node _L is not in the ordering, i.e., _L ^ v for all nodes v. 

Minimum Ordered Pairs A set of nodes X C V(T) is deep if no node in A is a proper ancestor of 
another node in X . For k deep sets of nodes Xi, . . . , Xk let ^(Ai, . . . , Xk) C (Xi x • • • x A^), be the set of 
tuples such that (xi, . . . , Xk) € &(Xi, . . . , Afe) iff xi < • • • < Xfe. If (xi, . . . , x^) G $(Ai, . . . , A^) and there is 
no (x[, . . . , x' k ) £ $(Ai, . . . , Xk), where either xi < x[ <x' k <Xk or xi < x' x < x^. < Xfe then the pair (xi, Xfe) is 
a minimum ordered pair. The set of minimum ordered pairs for Ai, . . . , X^ is denoted by mop(Ai, . . . , Xk). 
Figure 3.2 illustrates these concepts on a small example. For any set of pairs Y, let Y\ 1 and Y\ 2 denote the 



42 



projection of Y to the first and second coordinate, that is, if (jji, y 2 ) G Y then y\ G Y\ 1 and y 2 G ^"| 2 . We say 
that Y is deep if Y\ x and Y| 2 are deep. The following lemma shows that given deep sets X\, . . . ,X k we can 
compute mop(Xi, . . . iteratively by first computing mop(Ai,X 2 ) and then mop(mop(Ai, A 2 )| 2 , A 3 ) 
and so on. 

Lemma 11 For any deep sets of nodes X\,...,X k we have, (x\,x k ) € mop(Xi, . . . , X k ) iff there exists a 
x fe _i such that (x 1 ,x k _ 1 ) G mop(X 1; . . .,X k _i) and (x k -i,x k ) G mop(mop(X!, . . . , X k _{)\ 2 , X k ). 

Proof. We start by showing that if (x\,x k ) G mop(Ai, . . . , X k ) then there exists a node x k ~i such that 
(xi,x k -i) G mop(Xi, . . . ,X fe _i) and (x k -i,x k ) G mop(mop(Ai, . . . , A fe _i)| 2 , X k ). 

First note that (zi, ...,z k ) G $(Ai, . . . , X k ) implies (zi, . . . , z k -i) G $(Ai, . . . , X k -i). Since {x\,x k ) G 
mop(Ai, . . . ,X k ) there must be a minimum x k ~\ such that the tuple {x\, . . . ,x k -i) is in $(Ai, . . . , X k -i). 
We have (xi,Xfc_i) G mop(Ai, . . . , X k _i). We need to show (x k -i,x k ) G mop(mop(Ai, . . . , X k _\)\ 2 , X k ). 
Since {x\,x k ) G mop(Ai, . . . ,X k ) there exists no z G X k such that x k ^i <\ z < x k . Assume there exists a 
x G mop(Xi, . . . , Xfe_i)| 2 such that x k ^i < z <\ x k . Since (x,x k -i) G mop(Xi, . . . ,X k -i) this implies that 
there is a z' > xi such that (z 1 , z) G mop(Xi, . . . , X k -i). But this implies that the tuple (V, . . . , z, x k ) is in 
$(Xi, . . . , X k ) contradicting that (xi,Xfe) G mop(Xi, . . . , X k ). 

We will now show that if there exists a x k -\ such that (xi, x k -\) G mop(Xi, . . . , X k -\) and (xfe_i, x k ) G 
mop(mop(Xi, . . . , Xfe_i)| 2 , X k ) then the pair {x\,x k ) G mop(Xi, . . . , X k ). Clearly, there exists a tuple 
(xi, . . . , x k -\,x k ) G &(Xi, . . . , X k ). Assume that there exists a tuple {z\, . . . , z k ) G ^(Ai, . . . , A^) such that 
xi<izi<lz k <lx k . Since z k ^i<x k ^i this contradicts that (xi,Xfe_i) G mop(A l7 . . . , X k _i). Assume that there 
exists a tuple (z\, . . . , z k ) G $(Ai, . . . , X k ) such that X\<z\<iZ k <\x k . Since (xi, x k _\) G mop(Ai, . . . , X k _\) 
we have Xfe_i < z k ^\ and thus z/j > x fe _! contradicting (xfe_i,x fe ) G mop(mop(A l7 . . . ,X k _i)\ 2 ,X k ). □ 

When we want to specify which tree we mean in the above relations we add a subscript. For instance, 
v -<t w indicates that v is an ancestor of w in T. 

3.3 Computing Deep Embeddings 

In this section we present a general framework for answering tree inclusion queries. As in [KM95a] we solve 
the equivalent tree embedding problem. Let P and T be rooted labeled trees. An embedding of P in T is an 
injective function / : V(P) — > V(T) such that for all nodes v, u G V(P), 

(i) labcl(w) = label(/(w)). (label preservation condition) 

(ii) v -< u iff f(v) -< f(u). (ancestor condition) 

(iii) v < u iff f(v) < f(u). (order condition) 

An example of an embedding is given in Figure 3.1(c). 

Lemma 12 (Kilpelainen and Mannila [KM95a]) For any trees P and T , P C T iff there exists an 
embedding of P in T. 

We say that the embedding / is deep if there is no embedding g such that /(root(P)) -< g (root (P j). The 
deep occurrences of P in T, denoted emb(P, T) is the set of nodes, 

cmb(P, T) = {/(root(P)) | / is a deep embedding of P in T}. 

By definition the set of ancestors of nodes in emb(P, T) is exactly the set of nodes {u | P C T(u)}. Hence, 
to solve the tree inclusion problem it is sufficient to compute emb(P, T) and then, using additional O(rir) 
time, report all ancestors of this set. Note that the set emb(P, T) is deep. 

In the following we show how to compute deep embeddings. The key idea is to build a data structure for 
T allowing a fast implementation of the following procedures. For all X C V(T), Y C V(T) x V(T), and 
a G S define: 
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Parent(X): Return the set {parent(x) | x £ X}. 

Nca(F): Return the set {nca(j/i,y 2 ) | (2/1,2/2) € Y}. 

Deep(X): Return the set {x £ X \ there is no z e I such that x -< z}. 

Mop(F, X): Return the set of pairs R such that for any pair (2/1,2/2) € Y, (yi,x) G i? iff (2/2,2;) G 
mop(r| 2 ,X). 

Fl(X, a): Return the set {&(x, a) \ x £ X}. 

Collectively we call these procedures the sei procedures. The procedures Parent, Nca, and Fl are sclfcx- 
planatory. Deep(X) returns the set of all nodes in X that have no descendants in X. Hence, the returned 
set is always deep. Mop is used to iteratively compute minimum ordered pairs. If we want to specify that a 
procedure applies to a certain tree T we add the subscript T. With the set procedures we can compute deep 
embeddings. The following procedure Emb(v), v £ V(P), recursively computes the set of deep occurrences 
of P(v) in T. Figure 3.3 illustrates how Emb works on a small example. 

Emb(«): Let v\, . . . , Vk be the sequence of children of v ordered from left to right. There are three cases: 

1. k = (v is a leaf). Compute R := Deep(Fl(L(T), label(u))). 

2. k = 1. Recursively compute R\ := Emb(i>i). 
Compute R := Deep(Fl(Deep(Parent(Pi)), label(i>))). 

3. fc > 1. Compute Ri := Emb(vi) and set U\ := {(r,r) | r £ 

For i := 2 to fc, compute Ri := EMB(j)j) and Ui := Mop(?7,_i, Ri). 
Finally, compute R := Deep(Fl(Deep(Nca(E4)), label(v))). 

If R = stop and report that there is no deep embedding of P(v) in T. Otherwise return R. 

Lemma 13 For trees P and T and node v £ V(P), Emb(u) computes the set of deep occurrences of P(v) 
in T. 

Proof. By induction on the size of the subtree P(v). If v is a leaf we immediately have that emb(v,T) = 
Deep(Fl(L(T), label(w))) and thus case 1 follows. Suppose that v is an internal node with k > 1 children 
vi, . . . , Vk- We show that emb(P(i>), T) = Emb(w). Consider cases 2 and 3 of the algorithm. 

If fc = 1 we have that w £ Emb(ii) implies that label(w) = label(w) and there is a node w\ £ Emb(ui) 
such that fl(parent(wi), label(w)) = w, that is, no node on the path between w\ and w is labeled label(v). 
By induction Emb(«i) = emb(P(ui),T) and therefore w is the root of an embedding of P(v) in T. Since 
Emb(w) is the deep set of all such nodes it follows that w £ emb(P(v),T). Conversely, if w £ emb(P(v),T) 
then label(w) = label(w), there is a node ui\ £ cmb(P(ui),T) such that w -< wi, and no node on the path 
between w and w\ is labeled label(w), that is, B.(w\, label(w)) = w. Hence, w £ Emb(u). 

Before considering case 3 we first show that Uj = mop(EMB(«i), . . . , Emb(«j)) by induction on j, 2 < j < 
k. For j = 2 it follows from the definition of Mop that U2 = mop(EMB(«i), Emb^))- Hence, assume that 
j > 2. We have Uj = Mop([/,_i, Emb((/ j )) = Mop(mop(EMB(wi), . . . , EMB(«j_i)), Rj). By definition of 
Mop, Uj is the set of pairs such that for any pair (n, rj-i) € mop(EMB(ui), . . . , EMB(wj_i)), (n, rj) £ Uj 
iff (rj-i,rj) £ mop(mop(EMB(«i), . . . , EMB(«j_i))| 2 , i?j). By Lemma 11 it follows that (ri,rj) £ Uj iff 
(ri, rj) £ mop(EMB(wi), . . . , Emb^)). 

Next consider the case when k > 1. If w £ Emb(w) we have that label(w) = label(w) and there are 
nodes (wi,Wk) € mop(emb(P(wi), T), . . . , emb(P(vk), T)) such that w — fl(nca(wi, Wk), label(w)). Clearly, 
w is the root of an embedding of P(v) in T. Assume for contradiction that w is not a deep embedding, 
that is, w -< u for some node u £ emb(P(v),T). Since w = fl(nca(wi, Wk), label(w)) there must be nodes 
tti < ■ ■ ■ < Uk, such that £ emb(P(vi),T) and u = fl(nca(ui, Uk), label(w)). However, this contradicts 
the fact that (w\,Wk) £ mop(emb(P(wi), T), . . . , emb(P(vk), T)). If w £ emb(P(v),T) a similar argument 
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(e) (f) 

Figure 3.3: Computing the deep occurrences of P into T depicted in (a) and (b) respectively. The nodes in 
P are numbered 1-4 for easy reference, (c) Case 1 of Emb: The set Emb(3). Since 3 and 4 are leaves and 
label(3) = label(4) we have Emb(3) = Emb(4). (d) Case 2 of Emb. The set Emb(2). Note that the middle 
child of the root of T is not in the set since it is not a deep occurrence, (e) Case 3 of Emb: The two minimal 
ordered pairs of (d) and (c). (f) The nearest common ancestors of both pairs in (e) give the root node of T 
which is the only (deep) occurrence of P. 
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implies that w € Emb(w). 



□ 



The set L(T) is deep and in all tree cases of Emb(V) the returned set is also deep. By induction it follows 
that the input to Parent, Fl, Nca, and Mop is always deep. We will use this fact to our advantage in the 
following algorithms. 

3.4 A Simple Tree Inclusion Algorithm 

In this section we a present a simple implementation of the set procedures which leads to an efficient tree 
inclusion algorithm. Subsequently, we modify one of the procedures to obtain a family of tree inclusion 
algorithms where the complexities depend on the solution to a well-studied problem known as the tree color 
problem. 

3.4.1 Preprocessing 

To compute deep embeddings we require a data structure for T which allows us, for any v,w € V(T), to 
compute ncar(w, w) and determine if v ~< w or v <\ w. In linear time we can compute pre(w) and post(w) for 
all nodes v e V(T), and with these it is straightforward to test the two conditions. Furthermore, 

Lemma 14 (Harel and Tarjan [HT84]) For any tree T there is a data structure using 0{ut) space and 
preprocessing time which supports nearest common ancestor queries in 0(1) time. 

Hence, our data structure uses linear preprocessing time and space (see also [BFCOO, AGKR04] for more 
recent nearest common ancestor data structures). 

3.4.2 Implementation of the Set Procedures 

To answer tree inclusion queries we give an efficient implementation of the set procedures. The idea is to 
represent sets of nodes and sets of pairs of nodes in a left-to-right order using linked lists. For this purpose 
we introduce some helpful notation. Let X = [xi, . . . , Xk) be a linked list of nodes. The length of X, denoted 
\X\, is the number of elements in X and the list with no elements is written []. The ith node of X, denoted 
X[i], is Xi. Given any node y the list obtained by appending y to X , is the list X o y = [x\, . . . , Xk, y\- If for 
alH, 1 < i < \X\ — 1, X[i] < X[i + 1] then X is ordered and if X[i] < X[i + 1] then X is semiordered. A list 
Y = [(x\, Zk), ■ ■ ■ , (xk, Zk)} is a node pair list. By analogy, we define length, append, etc. for Y. For a pair 
Y[i] = (xi,Zi) define = x, and Y[i} 2 = z t . If the lists [Y [l]i, . . . , Y[k]{\ and [Y[l] 2 , . . . , Y[k] 2 ] arc both 
ordered or semiordered then Y is ordered or semiordered, respectively. 

The set procedures are implemented using node lists. All lists used in the procedures are either ordered 
or semiordered. As noted in Section 3.3 we may assume that the input to all of the procedures, except Deep, 
represent a deep set, that is, the corresponding node list or node pair list is ordered. We assume that the 
input list given to Deep is semiordered and the output, of course, is ordered. Hence, the output of all the 
other set procedures must be semiordered. In the following let X be a node list, Y a node pair list, and a a 
character in S. The detailed implementation of the set procedures is given below. We show the correctness 
in Section 3.4.3 and discuss the complexity in Section 3.4.4. 

Parent(A): Return the list [parent(A[l]), . . . , parent(X[|A|])]. 



NcA(y): 



Return the list [nca(Y[l]), . . . , nca(F[|Y|])]. 



Deep(X): 



Initially, set x := X[l] and R := []. 
For i := 2 to \X\ do: 



Compare x and X[i]. There are three cases: 
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(a) 

Figure 3.4: Case 1 and 2 from the implementation of Mop. 
Y\j] 2 <Y\i] 2 <x = X[h\. 




In (a) we have Y[i] 2 f5 x. In (b) we have 



1. x<X[i]. Set R:=R ox and x := X[i]. 

2. x < X[i\. Set x := X{i]. 

3. X[i] -< x. Do nothing. 

Return Ro x. 

The implementation of procedure Deep takes advantage of the fact that the input list is semiordered. In 
case 1 node X[i] to the right of our "potential output node" x. Since any node that is a descendant of x 
must be to the right of X[i] it cannot not appear later in the list X than X[i). We can thus safely add x to 
R at this point. In case 2 node x is an ancestor of X[i] and can thus not be in the output list. In case 3 
node X[i] is an ancestor of x and can thus not be in the output list. 

Mop(Y, X): Initially, set R := Q. 

Find the smallest j such that Y[l] 2 < X[j] and set y := V[l]i, x :— X\j], and h := j. If 
no such j exists stop. 

For i := 2 to |F| do: 

Set h := h + 1 until Y[i} 2 < X[h] or h > \X\. 

If h > \X\ stop and return R := R o (y,x). Otherwise, compare X[h] and x. There 
are two cases: 

1. If x < set R := Ro (y, x), y := y[i]i, and ir := 

2. If .t = set y := Y[i\l 

Return R := R o (y, x). 

In procedure Mop we have a "potential pair" (y, x) where y — Y[i]i for some i and Y[i] 2 < x. Let j be the 
index such that y = Y[j]i. In case 1 we have x < X[h] and also Y [j] 2 < Y[i] 2 since the input lists are ordered 
(see Figure 3.4(a)). Therefore, (y,x) is inserted into R. In case 2 we have x = X[h], i.e., Y[i] 2 <S x, and as 
before Y[j] 2 < Y[i]2 (see Figure 3.4(b)). Therefore (y,x) cannot be in the output, and we set (Y[i]i,a;) to 
be the new potential pair. 

FL(X,a): Initially, set Z := X, R := [], and S := []. 

Repeat until Z := \\: 

For i := 1 to |Z| do: If label(Z[i]) = a set i? := Insert (Z [i], _R). Otherwise set 
S := S o parent(Z[i]). 

Set S := Deep(S'), W := Deep* (5, R), and S := []. 
Return R. 
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The procedure Fl calls two auxiliary procedures: Insert(:e, R) that takes an ordered list R and insert the 
node x such that the resulting list is ordered, and Deep*(S', R) that takes two ordered lists and returns the 
ordered list representing the set Deep(S U R) n 5, i.e., Deep* (S, R) = [s e S\$z e R : s -< z]. Below we 
describe in more detail how to implement Fl together with the auxiliary procedures. 

We use one doubly linked list to represent all the lists Z, S, and R. For each element in Z we have 
pointers Pred and Succ pointing to the predecessor and successor in the list, respectively. We also have 
at each element a pointer Next pointing to the next element in Z. In the beginning Next = Succ for all 
elements, since all elements in the list are in Z. When going through Z in one iteration we simple follow 
the Next pointers. When Fl calls Insert (Z [i], R) we set Next(Pred(Z[i])) to Next(Z[i]). That is, all nodes 
in the list not in Z, i.e., nodes not having a Next pointer pointing to them, are in R. We do not explicitly 
maintain S. Instead we just set save Parent at the position in the list instead of Z[i). Now Deep(S') 
can be performed following the Next pointers and removing elements from the doubly linked list accordingly 
to procedure Deep. It remains to show how to calculate Deep*(S', R). This can be done by running through 
S following the Next pointers. At each node s compare Pred(s) and Succ(s) with s. If one of them is a 
descendant of s remove s from the doubly linked list. 

Using this linked list implementation Deep* (S,R) takes time 0(|S|), whereas using Deep to calculate 
this would have used time 0(\S\ + \R\). 

3.4.3 Correctness of the Set Procedures 

Clearly, Parent and Nca are correct. The following lemmas show that Deep, Fl, and Mop are also 
correctly implemented. For notational convenience we write x e X, for a list X, if x = X[i] for some i, 
l<i<\X\. 

Lemma 15 Procedure Deep(X) is correct. 

Proof. Let y be an element in X. We will first prove that if there are no descendants of y in X, i.e., 
X n V{T(y)) — 0, then y e R. Since X n V(T{y)) — we must at some point during the procedure have 
x = y, and x will not change before x is added to R. If y occurs several times in X we will have x — y each 
time we meet a copy of y (except the first) and it follows from the implementation that y will occur exactly 
once in R. 

We will now prove that if there are any descendants of y in V, i.e., X n V{T(y)) ^= 0, then y £ R. Let z 
be the rightmost and deepest descendant of y in V. There are two cases: 

1. y is before z in X. Look at the time in the execution of the procedure when we look at z. There are 
two cases. 

(a) x — y. Since y -< z we set x — z and proceed. It follows that y ^ R. 

(b) x = x' 7^ y. Since any node to the left of y also is to the left of z and X is an semiordered list we 
must have x' € V(T(y)) and thus y £ R. 

2. y is after z in X. Since z is the rightmost and deepest descendant of y and V is semiordered we must 
have x = z at the time in the procedure where we look at y. Therefore y £ R. 

If y occurs several times in X, each copy will be taken care of by either case 1 or 2. □ 
Lemma 16 Procedure Mop(Y, X) is correct. 

Proof. We want to show that for any 1 < Z < |y|, 1 < fc < |X| the pair (Y[Z]i, -X"[fc]) is in R if and only if 
(Y[Z]2,-X"[A;]) e mop(Y| 2 ,X). Since Y| 2 and X are ordered lists we have 

(Y[Z] 2 , X[k]) emop(X| 2 ,X) O X[k-l]<Y[l] 2 <\X[k]<Y[l + l] 2 , 
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for k > 2, and 

(Y[l] 2 ,X[l})€mo V (X\ 2 ,X) & Y[l] 2 <X[l]<Y[l + l} 2 , 

when k = 1. 

It follows immediately from the implementation of the procedure, that if Y\j] 2 < A[i], A[i — 1] < F[j] 2 , 
and Y[j + 1] 2 > then e R. 

We will now show that A[fc]) G i? (Y[Z] 2 , A[fc]) G mop(F| 2 ,A). That (Y[Z]i, X[k]) G R => 

A[fc — 1] < F[Z] 2 < X[k] follows immediately from the implementation of the procedure by induction on I. 

It remains to show that (F[Z]i, A[fc]) G R=> X[k] < Y[l + 1] 2 . Assume for the sake of contradiction that 
Y[l + 1] 2 < A[fc]. Consider the iteration in the execution of the procedure when we look at Y[l + 1] 2 . We 
have x = X[k] and thus set y := Y[l + l]i contradicting (F[Z]i, A[fc]) G i?. □ 

To show that Fl is correct we need the following proposition. 

Proposition 1 Let X be an ordered list and let x be an ancestor of X[i] for some i G {1, . . . ,k}. If x is an 
ancestor of some node in X other than X[i] then x is an ancestor of X[i — 1] or X[i + 1]. 

Proof. Assume for the sake of contradiction that x ^ X[i — 1], x ^ X[i + 1], and x -< z, where z G X and 
z 7^ X[i}. Since A is ordered either z < A[i — 1] or A[i + 1] < z. Assume z < X[i — 1]. Since x -< A[i], 
x ^ A[i - 1], and A[i - 1] is to the left of X[i], X[i - 1] is to the left of x. Since z<X[i — 1] and X[i— 1}<\x 
we have z < x contradicting x -< z. Assume X[i + 1] < z. Since x -< X[i], x ^ X[i + 1], and A[i + 1] is to 
the right of X[i], X[i + 1] is to the right of x. Thus x < z contradicting x -< z. □ 

Proposition 1 shows that the doubly linked list implementation of Deep* is correct. Clearly, Insert is 
implemented correct by the doubly linked list representation, since the nodes in the list remains in the same 
order throughout the execution of the procedure. 

Lemma 17 Procedure Fh(X, a) is correct. 

Proof. Let F = {fl(x, a) | x G X}. It follows immediately from the implementation of the procedure that 
Fh(X,a) C X. It remains to show that Deep(F) C Fh(X,a). Let x be a node in Deep(F)), let z G X be 
the node such that x — fl(z, a), and let z = x\, x 2 , . . . ,Xk = x be the nodes on the path from z to x. In each 
iteration of the algorithm we have Xi G Z for some i unless x G R. □ 



3.4.4 Complexity of the Set Procedures 

For the running time of the node list implementation observe that, given the data structure described in 
Section 3.4.1, all set procedures, except Fl, perform a single pass over the input using constant time at each 
step. Hence we have, 

Lemma 18 For any tree T there is a data structure using 0(ut) space and preprocessing which supports 
each of the procedures Parent, Deep, Mop, and Nca in linear time (in the size of their input). 

The running time of a single call to Fl might take time 0{ut). Instead we will divide the calls to Fl into 
groups and analyze the total time used on such a group of calls. The intuition behind the division is that 
for a path in P the calls made to Fl by Emb is done bottom up on disjoint lists of nodes in T. 

Lemma 19 For disjoint ordered node lists X\, . . . , Xk and labels a\, . . . , cik, such that any node in Xi + \ 
is an ancestor of some node in DEEP(FLT(Ai, a,)), 2 < i < k, all of FLx(Ai, a\), . . . , Fl^A^, atk) can be 
computed in 0(ht) time. 
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Proof. Let Z, R, and S be as in the implementation of the procedure. Since Deep and Deep* takes time 
0(| 51), we only need to show that the total length of the lists S — summed over all the calls — is 0(np) to 
analyze the total time usage of Deep and Deep*. We note that in one iteration |5| < \Z\. Insert takes 
constant time and it is thus enough to show that any node in T can be in Z at most twice during all calls 
to Fl. 

Consider a call to Fl. Note that Z is ordered at all times. Except for the first iteration, a node can be 
in Z only if one of its children were in Z in the last iteration. Thus in one call to Fl a node can be in Z 
only once. 

Look at a node z the first time it appears in Z. Assume that this is in the call FLpQ, oti). If z £ X then 
z cannot be in Z in any later calls, since no node in Xj where j > i can be a descendant of a node in X{. 
If z £ R in this call then z cannot be in Z in any later calls. To see this look at the time when z removed 
from Z. Since the set Z LI R is deep at all times no descendant of z will appear in Z later in this call to Fl, 
and no node in R can be a descendant of z. Since any node in Xj, j > i, is an ancestor of some node in 
Deep(Fl(X,, cti)) neither z or any descendant of z can be in any Xj, j > i. Thus z cannot appear in Z in 
any later calls to Fl. Now if z £ R then we might have z £ X i+ i. In that case, z will appear in Z in the 
first iteration of the procedure call FL(Xj_|_i, o^), but not in any later calls since the lists are disjoint, and 
since no node in Xj where j > i + 1 can be a descendant of a node in X i+i . If z £ R and z £ X i+ i then 
clearly z cannot appear in Z in any later call. Thus a node in T is in Z at most twice during all the calls. □ 



3.4.5 Complexity of the Tree Inclusion Algorithm 

Using the node list implementation of the set procedures we get: 

Theorem 6 For trees P and T the tree inclusion problem can be solved in 0(1 put) time and 0(np) space. 

Proof. By Lemma 18 we can preprocess T in 0(np) time and space. Let g(n) denote the time used by Fl 
on a list of length n. Consider the time used by EMB(root(P)). We bound the contribution for each node 
v £ V(P). From Lemma 18 it follows that if v is a leaf the cost of v is at most 0(g(lr))- Hence, by Lemma 19, 
the total cost of all leaves is 0(lpg(lr)) = 0(lpnp). If v has a single child w the cost is 0(g(\EMB(w)\)) . 
If v has more than one child the cost of Mop, Nca, and Deep is bounded by J2wechiid(v) 0(|Emb(w)|). 
Furthermore, since the length of the output of Mop (and thus Nca) is at most z — min^gchiid^) |Emb(w)| 
the cost of Fl is 0(g(z)). Hence, the total cost for internal nodes is, 

E mm |EmbH|)+ J2 |EmbH|)< ]T 0( 9 (\Emb(v)\)). (3.1) 

vev(p)\L(P) v u)echiid(u) 7 vev(P) 

Next we bound (3.1). For any w £ child(v) we have that Emb(w) and Emb(w) arc disjoint ordered lists. 
Furthermore we have that any node in Emb(w) must be an ancestor of a node in Deep(Fl(Emb(u>), label(w))). 
Hence, by Lemma 19, for any leaf to root path 5 = v\, . . . , Vk in P, we have that J2ues 9(\^ MB ( U )\) — 0(nr). 
Let A denote the set of all root to leaf paths in P. It follows that, 

J2 <?(|EMB(«)|) < J2 ]>>(I EMB HI) ^ 0(l P n T ). 

veV(T) pGAutp 

Since this time dominates the time spent at the leaves the time bound follows. Next consider the space 
used by EMB(root(P)). The preprocessing of Section 3.4.1 uses only 0(nr) space. Furthermore, by in- 
duction on the size of the subtree P(v) it follows immediately that at each step in the algorithm at most 
0(max 1 , e y(p) |Emb(u)|) space is needed. Since Emb(v) is a deep embedding, it follows that |Emb(w)| < lp. □ 
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3.4.6 An Alternative Algorithm 

In this section we present an alternative algorithm. Since the time complexity of the algorithm in the previous 
section is dominated by the time used by Fl, we present an implementation of this procedure which leads to 
a different complexity. Define a firstlabel data structure as a data structure supporting queries of the form 
&(v,a), v G V(T), a G S. Maintaining such a data structure is known as the tree color problem. This is a 
well-studied problem, see e.g. [Die89, MM96, FM96, AHR98]. With such a data structure available we can 
compute Fl as follows, 

Fh(X,a): Return the list R := [H(X[1], a), . . . , &(X[\X\], a)}. 

Theorem 7 Let P and T be trees. Given a firstlabel data structure using s{ut) space, p(nr) preprocessing 
time, and q{nx) time for queries, the tree inclusion problem can be solved in 0(p(ut) + nplx • q{nr)) time 
and 0(s(tit) + n T ) space. 

Proof. Constructing the firstlabel data structures uses 0(s(nr)) and 0(p(nr)) time. As in the proof of 
Theorem 6 we have that the total time used by EMB(root(P)) is bounded by Y^ v ev(P) 9(|Emb(u)|), where 
g{n) is the time used by Fl on a list of length n. Since Emb(w) is a deep embedding and each fl takes qirir) 
we have, 

5(|Emb(u)|) < Y gih) =n P l T ■ q{n T ). 

vev(P) vev(P) 

□ 

Several firstlabel data structures are available, for instance, if we want to maintain linear space we have, 

Lemma 20 (Dietz [Die89]) For any tree T there is a data structure using 0{tit) space, O(n-r) expected 
preprocessing time which supports firstlabel queries in O(loglognr) time. 

The expectation in the preprocessing time is due to perfect hashing. Since our data structure does not 
need to support efficient updates we can remove the expectation by using the deterministic dictionary of 
Hagerup et. al. [HMP01]. This gives a worst-case preprocessing time of Oinr log tit), however, using a 
simple two-level approach this can be reduced to 0{ut) (see e.g. [Tho03]). Plugging in this data structure 
we obtain, 

Corollary 1 For trees P and T the tree inclusion problem can be solved in 0(nplr log log ny + Ut) time 
and 0(ut) space. 

3.5 A Faster Tree Inclusion Algorithm 

In this section we present a new tree inclusion algorithm which has a worst-case subquadratic running time. 
As discussed in the introduction the general idea is to divide T into clusters of logarithmic size which we 
can efficiently preprocess and then use this to speedup the computation with a logarithmic factor. 

3.5.1 Clustering 

In this section we describe how to divide T into clusters and how the macro tree is created. For simplicity in 
the presentation we assume that T is a binary tree. If this is not the case it is straightforward to construct 
a binary tree B, where < 2nr, and a mapping g : V(T) — > V(B) such that for any pair of nodes 
v, w € V(T), label (i;) — label(g(v)), v -< w iff g(v) -< g(w), and v < w iff g(v) < g(w). If the nodes in the 
set U = V(B)\{g(v) | v e V(T)} is assigned a special label [3 E it follows that for any tree P, P C T iff 
P \— B. 

Let C be a connected subgraph of T. A node in V(C) incident to a node in V{T)\V(C) is a boundary 
node. The boundary nodes of C are denoted by SC. A cluster of C is a connected subgraph of C with 
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at most two boundary nodes. A set of clusters CS is a cluster partition of T iff V(T) = UceCsV(C), 
E(T) = U Ce csE(C), and for any d,C 2 £ CS, E(d) n E(C 2 ) = 0, \E(d)\ > 1, root(T) e <5C if 
root(T) e V(C). If \SC\ — 1 we call C a tea/ cluster and otherwise an internal cluster. 

We use the following recursive procedure ClusteRt(v, s), adopted from [AR02], which creates a cluster 
partition CS of the tree T(v) with the property that |C5| = O(s) and |V(C)| < \tit/s\. A similar cluster 
partitioning achieving the same result follows from [AHTOO, AHdLT97,Frc97]. 

Cl,USTERt(v, s): For each child u of v there are two cases: 

1. \V(T(u))\ + 1 < \n T /s\ . Let the nodes {v} U V(T(u)) be a leaf cluster with boundary 
node v. 

2. |V(T(«))| > \tit/s]. Pick a node w £ V(T(u)) of maximum depth such that 
I V(T(u))| + 2 - \V(T(w))\ < \n T /s\ . Let the nodes V (T (u))\V (T (w)) U {v, w} be an 
internal cluster with boundary nodes v and w. Recursively, compute ClusteRt(w, s). 

Lemma 21 Given a tree T with nr > 1 nodes, and a parameter s, where \ut/s\ > 2, we can build a cluster 
partition CS in O(nr) iwrae, such that \CS\ — O(s) and \V(C)\ < \riT/s\ for any C £ CS. 

Proof. The procedure Cluster^ (root (T) , s) clearly creates a cluster partition of T and it is straightforward 
to implement in 0(ut) time. Consider the size of the clusters created. There are two cases for u. In case 

1, |V(T(u))| + 1 < \n T /s\ and hence the cluster C = {v} U V(T(u)) has size \V(C)\ < \n T /s\. In case 

2, \V(T(u))\ + 2 - \V(T(w))\ < \n T /s\ and hence the cluster C = V (T (u))\V (T (w)) U {v,w} has size 
\V{C)\<\n T /d\. 

Next consider the size of the cluster partition. Let c = \nT/s~\. We say that a cluster C is bad if 
| V(C) | < c/2 and ^ood otherwise. We will show that at least a constant fraction of the clusters in the cluster 
partition are good. It is easy to verify that the cluster partition created by procedure Cluster has the 
following properties: 

(i) Let C be a bad internal cluster with boundary nodes v and w (v -< w). Then w has two children with 
at least c/2 descendants each. 

(ii) Let C be a bad leaf cluster with boundary node v. Then the boundary node v is contained in a good 
cluster. 

By (ii) the number of bad leaf clusters is no larger than twice the number of good internal clusters. By (i) 
each bad internal cluster C is sharing its lowest boundary node of C with two other clusters, and each of 
these two clusters are either internal clusters or good leaf clusters. This together with (ii) shows that number 
of bad clusters is at most a constant fraction of the total number of clusters. Since a good cluster is of size 
more than c/2, there can be at most 2s good clusters and thus \CS\ = 0(s). □ 

Let C G CS be an internal cluster v,w <E SC. The spine path of C is the path between v,w excluding 
v and w. A node on the spine path is a spine node. A node to the left and right of v, w, or any node on 
the spine path is a left node and right node, respectively. If C is a leaf cluster with v £ 5C then any proper 
descendant of v is a leaf node. 

Let CS be a cluster partition of T as described in Lemma 21. We define an ordered macro tree M . Our 
definition of M may be viewed as an " ordered" version of the macro tree defined in [AR02] . The node set 
V(M) consists of the boundary nodes in CS. Additionally, for each internal cluster C £ CS, v,w £ SC, 
v -< w, we have the nodes s(v, w), l(v, w) and r(v, w) and edges (v, s(v, w)), (s(v, w), w), (l(v, w), s(v, to)), and 
(r(v, w), s(v, w)). The nodes are ordered such that l(v, w) < w < r(v, w). For each leaf cluster C, v £ 5C, we 
have the node l(v) and edge (l(v),v). Since root(T) is a boundary node M is rooted at root(T). Figure 3.5 
illustrates these definitions. 
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Figure 3.5: The clustering and the macro tree, (a) An internal cluster. The black nodes are the boundary 
node and the internal ellipses correspond to the boundary nodes, the right and left nodes, and spine path, 
(b) The macro tree corresponding to the cluster in (a), (c) A leaf cluster. The internal ellipses are the 
boundary node and the leaf nodes, (d) The macro tree corresponding to the cluster in (c). 



To each node v G V(T) we associate a unique macro node denoted c(v). Let u G V(C), where C G CS. 



Conversely, for any macro node i G V{M) define the micro forest, denoted C(i), as the induced subgraph 
of T of the set of nodes {v | v G V(T),i — c(v)}. We also assign a set of labels to i given by label(i) = 
{label(t>) | v G V(C(i))}. If i is spine node or a boundary node the unique node in V(C(i)) of greatest 
depth is denoted by first (z). Finally, for any set of nodes {ii, . . . , ik} C V(M) we define C{i\, ■ ■ ■ ,ik) as the 
induced subgraph of the set of nodes V{C(i\)) U • • • U V(C(ik))- 

The following propositions states useful properties of ancestors, nearest common ancestor, and the left- 
to-right ordering in the micro forests and in T . The propositions follows directly from the definition of the 
clustering. See also Figure 3.6. 

Proposition 2 (Ancestor relations) For any pair of nodes v,w G V(T), the following hold 
(i) If c(v) = c(w) then v -< T w iff v <c(c(v)) w. 

(ii) If c(v) 7^ c(w), c(v) e {s(v',w'),v'}, and c(w) € {l(v' ,w'),r(v' ,w')} then we have v -< T w iff 

v <C{c{v),s{v',w'),v') w. 

(Hi) In all other cases, w <t v iff c(w) <m c(v). 




u 



if u is boundary node, 

if u is a leaf node and v G 5C, 



if u is a spine node, v, w G SC, and v ~< w, 
if u is a left node, v, w G SC, and v ~< w, 
if u is a right node, v, w G SC, and v -< w. 
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Figure 3.6: Examples from the propositions. In all cases v' and w' are top and bottom boundary nodes of 
the cluster, respectively, (a) Proposition 2(h). Here c(v) — s(v' , w') and c(w) — l(v' , w') (solid ellipses). The 
dashed ellipse corresponds to C(e(t>), s(v' , w'). v'). (b) Proposition 3(i) and 4(h). Here c(v) = c(w) = /(?/, w' ) 
(solid ellipse). The dashed ellipse corresponds to C(c(v), s(v' ,w'),v'). (c) Proposition 3(h) and 4(i). Here 
c(v) — c(w) — l{v') (solid ellipse). The dashed ellipse corresponds to C(c(v),v'). (d) Proposition 3(iii). Here 
c(v) — l(v',w') and c(w) = s(w',w') (solid ellipses). The dashed ellipse corresponds to C {c{v) , c{w) , v' ) . (e) 
Proposition 3(iv). Here c(v) — s(w',w') and c(w) = r(w',w') (solid ellipses). The dashed ellipse corresponds 
to C(c(w), c(u>), v'). (f) Proposition 4(iv). Here c(v) = r(v',w') and c(w) — l(v',w') (solid ellipses). The 
dashed ellipse corresponds to C(c(v),c(w), s(v' ,w'),v'). (g) Proposition 4(v). Here c(v) = r(v',w') (solid 
ellipse) and w' <m c(w). The dashed ellipse corresponds to C(c(v), s(v' , w'), v' , w')). 



Case (i) says that if v and w belongs to the same macro node then v is an ancestor of w iff v is an ancestor 
of w in the micro forest for that macro node. Case (ii) says that if v is a spine node or a top boundary node 
and w is a left or right node in the same cluster then v is an ancestor of w iff v is an ancestor of w in the 
micro tree induced by that cluster (Figure 3.6(a)). Case (iii) says that in all other cases v is an ancestor of 
w iff the macro node v belongs to is an ancestor of the macro node w belongs to in the macro tree. 

Proposition 3 (Left-of relations) For any pair of nodes v, w <G V(T), the following hold 
(i) Ifc(v)=c(w)e{r(v',w'),l(v',w')} thenv<\w iff v < C (c(v),v',s(v' ,w')) w- 
(ii) If c(v) = c(w) = l(v') then v < w iff v <c(c(v),v') w - 

(iii) If c(v) — l(v' ,w') and c(w) = s(v' ,w') thenv<]w iff v <c(c(v).c(w).v') w ■ 

(iv) If c(v) — s(v' ,w') and c(w) — r(v' ,w') then v < w iff v <c(c(v),c(w),v') w - 



54 



(v) In all other cases, v < w iff c(v) <m c{w). 



Case (i) says that if v and w are both either left or right nodes in the same cluster then v is to the left of w iff 
v is to the left of to in the micro tree induced by their macro node together with the spine and top boundary 
node of the cluster (Figure 3.6(b)). Case (ii) says that if v and w are both leaf nodes in the same cluster 
then v is to the left of w iff v is to the left of w in the micro tree induced by that leaf cluster (Figure 3.6(c)). 
Case (iii) says that if v is a left node and w is a spine node in the same cluster then v is to the left of w iff 
v is to the left of w in the micro tree induced by their two macro nodes and the top boundary node of the 
cluster (Figure 3.6(d)). Case (iv) says that if v is a spine node and w is a right node in the same cluster 
then v is to the left of w iff v is to the left of w in the micro tree induced by their two macro nodes and the 
top boundary node of the cluster (Figure 3.6(e)). In all other cases v is to the left of w if the macro node v 
belongs to is to the left of the macro node of w in the macro tree (Case (v)). 

Proposition 4 (Nca relations) For any pair of nodes v, w € V(T), the following hold 

(i) If c(v) = c(w) = l(v') then ncar(w, w) = ncac( c (v).v')( v T w )- 

(ii) Ifc(v) =c(w) € {l(v',w'),r(v',w')} then nca T (v, to) = nca, C ( c ( v ) tS ( v >, w >), v >)(v, w). 

(iii) If c(v) = c(to) = s(v',w') then tlc&t(v, w) — nc&c( c (v))(v , w). 

(iv) If c(v) ^ c(to) and c(v) , c(w) € {l(v' ,w'),r(v' ,w'), s(v' ,w')} then 
ncar(w,w) = n.c&c( c (v),c(w),s(v>,w'),v>)(v,w). 

(v) If c(v) y^c(w), c(v) <G {l(v' ,w'),r(v' ,w'), s(v' ,w')}, and to' <m c(w) then 
nc& T (v,w) = n.c&c( c ( v ), s (v' ,w>),v' ,w')(v, w'). 

(vi) If c(v) ^=c{w), c(w) £ {l(v',w'),r(v',w'),s(v',w')}, and w' -<m c(v) then 
nc& T (v,w) = nca,c( c ( w '),s(v' ,w),v',w')(w,w') . 

(vii) In all other cases, nc& T (v,w) = nca, M (c(v),c(w)). 

Case (i) says that if v and w are leaf nodes in the same cluster then the nearest common ancestor of v and 
w is the nearest common ancestor of v and w in the micro tree induced by that leaf cluster (Figure 3.6(c)). 
Case (ii) says that if v and to are both either left nodes or right nodes then the nearest common ancestor of 
v and w is the nearest common ancestor in the micro tree induced by their macro node together with the 
spine and top boundary node of the cluster (Figure 3.6(b)). Case (iii) says that if v and w are both spine 
nodes in the same cluster then the nearest common ancestor of v and w is the nearest common ancestor 
of v and to in the micro tree induced by their macro node. Case (iv) says that if v and to are in different 
macro nodes but are right, left, or spine nodes in the same cluster then the nearest common ancestor of v 
and to is the nearest common ancestor of v and to in the micro tree induced by that cluster (we can omit 
the bottom boundary node) (Figure 3.6(f)). Case (v) says that if v is a left, right, or spine node, and the 
bottom boundary node to' of v's cluster is an ancestor in the macro tree of the macro node containing to, 
then the nearest common ancestor of v and to is the nearest common ancestor of v and to' in the micro tree 
induced by the macro node of v, the spine node, and the top and bottom boundary nodes of v's cluster 
(Figure 3.6(g)). Case (vi) is the same as case (v) with v and to interchanged. In all other cases the nearest 
common ancestor of v and to is the nearest common ancestor of their macro nodes in the macro tree (Case 
(vii)). 

3.5.2 Preprocessing 

In this section we describe how to preprocess T. First build a cluster partition CS of the tree T with 
clusters of size s, to be fixed later, and the corresponding macro tree M in O(nx) time. The macro tree 
is preprocessed as in Section 3.4.1. However, since nodes in M contain a set of labels, we now store a 
dictionary for label (u) for each node v € V(M). Using the deterministic using the deterministic dictionary 



of Hagcrup et. al. [HMP01] all these dictionaries can be constructed in 0(nr log nr) time and 0(tit) space. 
Furthermore, we extend the definition of fl such that Am(v,o} is the nearest ancestor w of v such that 
a £ label 

Next we show how to preprocess the micro forests. For any cluster C £ CS, deep sets X,Y,ZC V(C), 
i £ N, and a £ I] define the following procedures on cluster C. 

LEFTc (i, X): Return the leftmost i nodes in X. 

RiGHTc (i, X) : Return the rightmost i nodes in X. 

LEFTOFc (X, Y): Return all nodes of X to the left of the leftmost node in Y. 

MATCH C (Jf, Y, Z), where X = {mi<- • -Owfe}, Y = {vi < • • • <\v k }, and Z C Y. Return R := {rrij \ vj £ Z}. 
mop c (X, Y) Return the pair (Ri,R 2 ). Where R 1 = mop(M, N)\ 1 and R 2 = mop(M, N)\ 2 . 

In addition to these procedures we also define the set procedures on clusters, that is, Parent^, NCAc, 
Deepc, and Flc, as in Section 3.3. Collectively, we will call these the cluster procedures. We represent the 
input and outputs set in the procedures as bit strings indexed by preorder numbers. Specifically, a subset 
X in a cluster C is given by a bit string b\ . . .b s , such that bi = 1 iff the ztli node in a preorder traversal of 
C is in X . If C contains fewer than s nodes we leave the remaining values undefined. 

The procedures left^ (i, X) then corresponds to setting all bits in X larger than the ith set bit to zero. 
Similarly, RiGHTc(i, X) corresponds to setting all bits smaller than the ith largest set bit to zero. Similarly, 
the procedures leftof c (^, Y), RiGHTOFcpf, Y), and match c (^, Y, Z) only depends on the preorder of 
the nodes and thus only on the bit string not any other information about the cluster. We can thus ommit 
the subscript C from these five procedures. 

Next we show how to implement the cluster procedures efficiently. We precompute the value of all 
procedures, except Flc, for all possible inputs and clusters. By definition, these procedures do not depend 
on any specific labeling of the nodes in the cluster. Hence, it suffices to precompute the value for all rooted, 
ordered trees with at most s nodes. The total number of these is less than 2 2s (consider e.g. an encoding 
using balanced parenthesis). Furthermore, the number of possible input sets is at most 2 s . Since at most 
3 sets are given as input to a cluster procedure, it follows that we can tabulate all solutions using less 
than 2 3s • 2 2s = 2 5s bits of memory. Hence, choosing s < 1/10 log n we use O(2 2 lo s™) = 0(<Jn) bits. 
Using standard bit wise operations each solution is easily implemented in O(s) time giving a total time of 
0{s/n\ogn). 

Since the procedure Fhc depends on the alphabet, which may be of size jit, we cannot efficiently apply 
the same trick as above. Instead define for any cluster C £ CS, X C V(C), and a £ T,: 

AncestoRc(^): Return the set {x \ x is an ancestor of a node in X}. 

EQ c (a): Return the set {x \ x £ V(C), labclfx) = a}. 

Clearly, AncestoRc can be implemented as above. For Eq^ note that the total number of distinct labels 
in C is at most s. Hence, Eq c can be stored in a dictionary with at most s entries each of which is a bit 
string of length s. Thus, (using again the result of [HMP01]) the total time to build all such dictionaries is 
0(jit log tit)- 

By the definition of Flc we have that, 

¥h c {X, a) = Deep c (AncestoRcPO fl EQ c (a)). 

Since intersection can be implemented using a binary anrf-operation, FLp (X, a) can be computed in constant 
time. Later, we will also need to compute union of bit strings and we note that this can be done using a 
binary or-operation. 

To implement the set procedures in the following section we often need to "restrict" the cluster procedures 
to work on a subtree of a cluster. Specifically, for any set of macro nodes . . . , ik} in the same cluster C 
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(hence, k < 5), we will replace the subscript C with C{i\, . . . , «&)• For instance, PA-RENTc( s (v,w),i(v,w))(X) = 
{parent(x) | x e X n V(C(s(v, w), l(v, w))} n F(C(s(u, tu), Z(v, w)). To implement all restricted versions of 
the cluster procedures, we compute for each cluster C G CS a bit string representing the set of nodes in each 
micro forest. Clearly, this can be done in 0(ut) time. Since there are at most 5 micro forests in each cluster 
it follows that we can compute any restricted version using an additional constant number of and-operations. 

Note that the total preprocessing time and space is dominated by the construction of deterministic 
dictionaries which use 0(nr log tit) time and 0(ut) space. 

3.5.3 Implementation of the Set Procedures 

Using the preprocessing from the previous section we show how to implement the set procedures in sublinear 
time. First we define a compact representation of node sets. Let T be a tree with macro tree M. For 
simplicity, we identify nodes in M with their preorder number. Let S C V(T) be any subset of nodes of T. 
A micro-macro node array (abbreviated node array) X representing S is an array of size tim- The zth entry, 
denoted X[i], represents the subset of nodes in C(i), that is, X[i] — V(C(i)) n S. The set X[i] is encoded 
using the same bit representation as in Section 3.5.2. By our choice of parameter in the clustering the space 
used for this representation is 0(nr / 'lognr)- 

We now present the detailed implementation of the set procedures on node arrays. Let X be a node 
array. 

Parent(X): Initialize a node array R of size um and set i := 1. 
Repeat until i > um- 

Set i := i + 1 until X[i] ^ 0. 

There are three cases depending on the type of i: 

1. i e {l(v, w),r(v,w)}. Compute N :— PARENTc(i tS (v,w),v){X[i\). For each j e 
{z, s(v, w), v}, set R[j] := R[j] U (N n V(C(j))). 

2. i = l(v). Compute := Parent,^ „)(X[i]). For each j e {i,v}, set R[j] := 
R\j]U(NnV(C(j))). 

3. i g {l(v,w),r(v,w),l(v)}. Compute N := Parent c(l) (X[i]). If TV ^ set R[i] := 
R[i] U N. Otherwise, if j := parcnt M (i) ^ _L set R[j] := R[j] U {first (j)}. 

Set i:=i + l. 
Return R. 

To see the correctness of the implementation of procedure Parent consider the three cases of the procedure. 
Case 1 handles the fact that left or right nodes may have a node on a spine or boundary node as parent. 
Since no left or right nodes can have a parent outside their cluster there is no need to compute parents in 
the macro tree. Case 2 handles the fact that a leaf node may have the boundary node as parent. Since no 
leaf node can have a parent outside its cluster there is no need to compute parents in the macro tree. Case 
3 handles boundary and spine nodes. In this case there is either a parent within the micro forest or we can 
use the macro tree to compute the parent of the root of the micro tree. Since the input to Parent is deep 
we only need to do one of the two things. If the computation of parent in the micro tree returns a node j, 
this will either be a spine node or a boundary node. To take care of the case where j is a spine node, we 
add the lowest node (first(j)) in j to the output. Procedure Parent thus correctly computes parent for all 
kinds of macro nodes. 

We now give the implementation of procedure Nca. The input to procedure Nca is two node arrays X 
and Y representing two subsets X,y C V(T), \X\ = \y\ = k. The output is a node array R representing the 
set {nca(A' i , 3^i) | 1 < % < k}, where X t and is the ith element of X and y, wrt. to their preorder number 
in the tree, respectively. We also assume that we have Xi -< for all i (since Nca is always called on a set 
of minimum ordered pairs). 
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Nca(X,Y): Initialize a node array R of size tim, set i := 1 and j := 1. 
Repeat until i > nu or j > um- 



Until X[i] ^ set i := i + 1. Until Y~[j] ^ set j := j + 1. 
Compare i and j. There are two cases: 

1. i = j. There are two subcases: 

(a) i is a boundary node. 

Set R[i] := X[i], i := i + 1 and j := j + 1. 

(b) i is not a boundary node. 

Compare the sizes of X[i] and Y[j]. There are two cases: 



Compute N := Nca s (JV 4 , Yj). 

For each macronode /i in S set := R[h] U (TV n V(C(ft))). 
Set := \ X, and j := j + 1. 



2. « 7^ j. Compare the sizes of X[i] and Y[j}. There are three cases: 



- \X[i\\ > \Y[j}\. Set X t := LEFT(\Y\j}\, X[i\) and Y j := Y\j], 

- \X[i]\ < \Y[j}\. Set X t := and K,- := LEFT(|X[i]|, Y\j}), 

- \X\i}\ - \Y[j}\. Set X, := and Yj := Y\j}. 



Compute h := NCAM(i,j). There arc two subcases: 

(a) h is a boundary node. Set R[h] := 1. 

(b) /i is a spine node s(v,w). There are three cases: 

i. i G w),s(v, w)} and j G {s(v, w),r(v, w)}. 
Compute N := NCAc^^,,^)^, Y,-). 

ii. i = l(v, w) and u> ^ j. 

Compute TV := NcA C ( i;/lil); „ J )(RlGHT(l, Xi),w). 

iii. j = r(u,w) and w <i. 



Compute N := NCA C (j,h,w.v)(w, left(1, Yj)). 
Set R[h] := J?[/i] U (TV n V(C(ft))) and := U (N n V(C(«))). 
Set := \ X, and Y\j] := Y\j] \ Yj. 



- \X[t\\ > \Y[j}\. Set X t := LEFT(|r[j]|, X\i]), 

- \X\i]\ = \Y[j}\. SetX, = X[i}. 




Return R. 



In procedure Nca we first find the next non-empty entries in the node arrays X[i] and Y[j]. Then we have 
two cases depending on whether i = j or not. If i = j (Case 1) we have two subcases. If i is a boundary 
node (Case 1(a)) then C(i) only consists of one node v = X\i\ = Y[j] and therefore nca(w, v) = v = X[i]. If 
i is not a boundary node (Case 1(b)) we compare the sizes of the subsets represented by X[i] and Y[i]. If 
\X[i] \ > \Y[j]\ we compute nearest common ancestors of the first /leftmost \Y[j] \ nodes in X[i] and the nodes 
in Y[j]. Due to the assumption on the input (Xi -< 3^i) we either have \X[i]\ > \Y[j] \ or \X[i] \ = \Y[j]\. If 
\X[i]\ > \Y[j]\ we must compute nearest common ancestors of the first/leftmost \Y[j]\ nodes in X[i] and the 
nodes in Y[j]. If \X[i]\ = \Y[j]\ we must compute nearest common ancestors of all nodes in X[i] and Y\j\. 
We now compute nearest common ancestors of the described nodes in a cluster S depending on what kind of 
node i is. If i is a leaf node then the nearest common ancestors of the nodes in X[i] and Y[j] is either in i or 
in the boundary node (Proposition 4(i)). If i is a left or right node then the nearest common ancestors must 
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be in i on the spine or in the top boundary node (Proposition 4(h)). If i is a spine node then the nearest 
common ancestors must be on the spine or in the top boundary node (Proposition 4(iii)). We update the 
output node array, remove from X[i] the nodes we have just computed nearest common ancestors of, and 
increment j since we have now computed nearest common ancestors for all nodes in Y[j]. 

Now consider the case where i ^ j. First we compare the sizes of the subsets represented by X[i] and Y[i]. 
If \X[i]\ > \Y[j]\ we should compute nearest common ancestors of the first/leftmost \Y[j]\ nodes in X[i] and 
the nodes in Y[j] as in Case 1(b). If \X[i)\ < \Y[j]\ we must compute nearest common ancestors of the 
first/leftmost \X[i] \ nodes in Y[j] and the nodes in X[i]. Otherwise \X[i] \ = \Y[j] \ and we compute nearest 
common ancestors of the all nodes in X[i] and Y[j]. We now compute the nearest common ancestor of i and 
j in the macro tree. This must either be a boundary node or a spine node due to the structure of the macro 
tree. If it is a boundary node then the nearest common ancestor of all nodes in i and j is this boundary node. 
If it is a spine node we have three different cases depending on the types of i and j . If i is a left or spine node 
and j is a spine or right node in the same cluster then we compute nearest common ancestors in that cluster 
(Proposition 4(iv)). If i is a left node and j is a descendant of the bottom boundary node in i's cluster then 
we compute the nearest common ancestor of the rightmost node in Xi and w in fs cluster (Proposition 4(v)). 
That we can restrict the computation to only the rightmost node in Xi and w is due to the fact that we 
always run Deep on the output from Nca before using it in any other computations. In the last case j 
is a right node and i is a descendant of the bottom boundary node of j's cluster. Then we compute the 
nearest common ancestor of the leftmost node in Yj and w (Proposition 4(vi)) in j's cluster. The argument 
for restricting the computation to the leftmost node of Yj and w is the same as in the previous case. Due 
to the assumption on the input (Xi -< yi) the rest of the cases from Proposition 4(iv)-(vi) cannot happen. 
Therefore, we have now argued that the procedure correctly takes care of all cases from Proposition 4. 
Finally, we update the output node array and remove from X[i] and Y[j] the nodes we have just computed 
nearest common ancestors of. 

The correctness of the procedure follows from the above and induction on the rank of the elements. 

Deep(X): Initialize a node array R of size tim and set j := 1. 
Repeat until i > %: 

Set i := i + 1 until X[i] ^ 0. 

Compare j and i. There are three cases: 

1. j < i. Set 

'C(j,v), if j — l(v), 

s '■= \ C(j,s(v,w),v), if j e {l(v,w),r(v,w)}, 
C(j), otherwise. 

Set R\j] := Deep s (X[j]) and j := i. 

2. j -< i. If i e {l(v,w),r(v, w)} and j = s(v,w) compute N := DEEPc(i, s (v,w),v)(X[i] U 
X\j}), and set R\j] := R\j] C\N. 

Set j := i. 

Set i:=i + l. 

Set R[j] := Deep s(X[j]), where S is set as in Case 1. 
Return R. 



The above Deep procedure resembles the previous Deep procedure implemented on the macro tree in the 
two first cases. The third case from the previous implementation can be omitted since the input list is now 
in preorder. In case 1 node i is to the right of our "potential output node" j. Since any node / that is a 
descendant of j must be to the left of i (I < i) it cannot not appear later in the list X than i. We can thus 
safely add Deep s(X\j]) to R at this point. To ensure that the cluster we compute Deep on is a tree we 
include the top boundary node if j is a leaf node and the top and spine node if j is a left or right node. In 
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case 2 node j is an ancestor of i and can therefore not be in the output list unless j is a spine node and 
i is the corresponding left or right node. If this is the case we first compute Deep of X[j] in the cluster 
containing i and j and add the result to the output before setting i to be our new potential node. After 
scanning the whole node array X we add the last potential node j to the output after computing Deep of 
it as in case 1. 

That the procedure is correct follows by the proof of Lemma 15 and the above. 

We now give the implementation of procedure Mop. Procedure Mop takes a pair of node arrays (X, Y) 
and another node array Z as input. The pair (X, Y) represents a set of minimum ordered pairs, where the 
first coordinates are in X and the second coordinates are in Y. To simplify the implementation of procedure 
Mop it calls two auxiliary procedures MopSim and Match defined below. Procedure MopSim computes 
mop of Y and Z, and procedure Match takes care of finding the first-coordinates from X corresponding to 
the first coordinates from the minimum ordered pairs from M. 

Mop((X,Y),Z) Compute M := MopSim(Y,Z). Compute R := Match (X, Y,M\ 1 ). Return {R,M\ 2 ). 
Procedure MopSim takes two node arrays as input and computes mop of these. 

MopSim(X, Y) Initialize two node arrays R and S of size tim, sc "t i ■= 1, j ■= 1, h := 1, (r"i,r 2 ) := (0,0), 
(si, s 2 ) := (0,0). Repeat the following until i > nu or j > um- 

Set i := i + 1 until X[i] ^ 0. There are three cases: 

1. If i = l(v, w) for some v, w set j := j + 1 until Y[j] ^ and either i <j, i = j, or 
j = s(v,w). 

2. If i = s(v,w) for some v,w set j := j + 1 until Y[j] ^ and either i < j or 
j = r(v,w). 

3. If i € {r(v, w), l(v)} for some v, w set j := j + 1 until Y[j] ^ and cither i < j or 
i =3- 

4. Otherwise (i is a boundary node) set j := j + 1 until Y\j] ^ and i < j. 
Compare i and j. There are two cases: 

1. i <j: Compare s\ and j. If si <j set R[r\] := R[ri] U S[s\] := S[s\] U S2, and 

(S1,S 2 ) := (j,LEFT c(j) (l,F[j])). 

Set (ri,r 2 ) := (i, RIGHT^^^I, X[i])) and i = i + 1. 

2. Otherwise compute (r, s) := MOP C ( ij -. t ,)(A"[i], Y[j\), where v is the top boundary 
node in the cluster i and j belongs to. 

If r ^ do: 

— Compare si and j. If s\ <\ j or if s\ = j and leftofc^^^X^], s 2 ) = then 
set R[n] := R[n] U r 2 , 5[si] := 5[si] U s 2 . 

- Set (n,r 2 ) := (i,r) and (si,s 2 ) := (j, s). 
There are two subcases: 

(a) i — j or i — l(v,w) and j = s(w,w). Set := right C (i){X[i]) \ r 2 and 
j:=j + l. 

(b) i = s(u, w) and j = r(u, w). If r 2 = set j := j + 1 otherwise set i := i + 1. 

Set i?[n] := i?[ri] U r 2 and 5[si] := 5[si] U s 2 . Return (R, S). 

Procedure MopSim is somewhat similar to the previous implementation of the procedure Mop from Sec- 
tion 3.4.2. We again have a "potential pair" ((ri,r 2 ), (si,s 2 )) but we need more cases to take care of the 
different kinds of macro nodes. 

We first find the next non-empty macro node i. We then have 4 cases depending on which kind of node 
i is. In Case 1 i is a left node. Due to Proposition 3 we can have mop in i (case (i)), in the spine (case (iii)), 
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or in a node to the left of i (case(v)). In Case 2 i is a spine node. Due to Proposition 3 we can have mop in 
the right node (case (iv)) or in a node to the left of i (case(v)). In Case 3 i is a right node or a leaf node. 
Due to Proposition 3 we can have mop in i (case (i) and (ii)) or in a node to the left of i (case(v)). In the 
last case (Case 5) i must be a boundary node and mop must be in a node to the left of i. 

We then compare i and j. The case were i<j is similar to the previous implementation of the procedure. 
We compare j with our potential pair. If si <\ j then we can insert r 2 and s 2 into our output node arrays 
R and S, respectively. We also set s\ to j and s 2 to the leftmost node in Y[j]. Then — both if s± < j or 
si = j — we set n to i and r 2 to the rightmost node in X[i). We have thus updated ((ri,r 2 ), (si, s 2 )) to be 
our new potential pair. That we only need the rightmost node in X[i] and the leftmost node in Y[j] follows 
directly from the definition of mop. 

Case 2 (i j) is more complicated. In this case we need to compute mop in the cluster i and j belongs 
to. If this results in any minimum ordered pairs (r ^ 0) we must update our potential pair. As in the 
previous case we compare Si and j, but this time we must also add n and si to the output if s\ — j and no 
nodes in X[i] are to the left of the leftmost node in s 2 . To see this first note that since r\ < i (the input is 
deep) we must have n ^ s\ and thus s 2 contains only one node s'. If s' is to the left of all nodes in X[i] then 
no node in X[i] can be in a minimum ordered pair with s' and we can safely add our potential pair to the 
output. We then update our potential pair. Finally, we need to update X[i], i, and j. This update depends 
on which kind of macro nodes we have been working on. In Case (a) we either have i = j or i is a left node 
and j is a spine node. In both cases we can have nodes in X[i] that are to not to the left of any node in 
Y[j]. The rightmost of these nodes can be in a minimum ordered pair with a node from another macro node 
and we thus update X[i] to contain this node only (if it exists). Now all nodes in Y[j] must be to the left 
of all nodes in X[i] in the next iteration and thus we increment j. In Case (b) i is a spine node and j is a 
right node. If r 2 = then no node in Y[j] is to the right of the node in X[i]. Since the input arrays are 
deep, no node later in the array X can be to the left of any node in Y[j] and we therefore increment j. If 
r 2 =/= then the single node in X[i] is in the potential pair and we increment i. We do not increment j as 
there could be nodes in X[j] to the left of the nodes in Y[j). When reaching the end of one of the arrays we 
add our potential pair to the output and return. 

The correctness of the procedure follows from the proof of Lemma 16 and the above. 

Procedure Match takes three node arrays X, Y, and Y' representing deep sets X, y, and y', where 
\X\ = \y\, and y' C y. The output is a node array representing the set {Xj \ y^ e y'}. 

MATCHpf , Y, Y') Initialize a node array R of size n M , set X L := 0, Y L := 0, Y[ := 0, x := 0, y := 0, i := 1 
and j := 1. 

Repeat until i > um or j > um- 

Until X[i] ^ set i := i + 1. Set x := \X[i]\. 
Until Y\j] ^ set j := j + 1. Set y := \Y[j}\. 
Compare Y[j] and Y'[j]. There are two cases: 

1. Y[j] = Y'[j}. Compare x and y. There are three cases: 

(a) x = y. Set R[i] := R[i] U X[i], i:=i + l, and j := j + 1. 

(b) x < y. Set R[{] := R\i] U X[i\, Y[j] := Y[j] \ left(x, Y\j]), Y'[j] := Y[j], 
and i := i + 1. 

(c) x > y. Set X L := LEFT(y, X[i\), R[i] := R[i] U X L , X[i] := X\i] \ X L , and 
J :=./ + !• 

2. Y[j] 7^ Y'[j}. Compare x and y. There are three cases: 

(a) x = y. Set R[i] := U match (X [i], Y[j], Y'[j}), i:=i + l, and j := j + 1. 

(b) x < y. Set Y L := left(x, y[j]), Y' L := n Yb, 

R\i] := R\i]UUATCH(X\i],Y L ,Y£), Y[j] := Y{j)\Y L , Y'{j] := and 
i:=i + l. 
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(c) x > y. Set X L := LEFT(y, X[i\), R[i\ := R[i] U match (X l , Y\j], Y'[j}), 
X[i\ :=X[i]\X L , and j :=j + l. 

Return R. 

Procedure Match proceeds as follows. First we find the next non-empty entries in the two node arrays X[i] 
and Y[j]. We then compare Y[j] and Y'[j\. 

If they are equal we keep all nodes in X with the same rank as the nodes in Y[j]. We do this by splitting 
into three cases. If there are the same number of nodes X[i] and Y[j] we add all nodes in X[i] to the output 
and increment i and j. If there are more nodes in Y[j] than in X[i] we add all nodes in X[i] to the output 
and update Y[j] to contain only the y — x lcfmost nodes in Y[j]. We then increment i and iterate. If there 
are more nodes in X[i] than in Y[j] we add the first y nodes in X[i] to the output, increment j, and update 
X[i] to contain only the nodes we did not add to the output. 

If 7^ we ca U t ne cluster procedure MATCH. Again we split into three cases depending on the 

number of nodes in X[i] and Y[j]. If they have the same number of nodes we can just call match on X[i], 
Y[j], and Y'[j] and increment i and j. If \Y[j] \ > \X[i] \ we call match with X[i] the leftmost \X[i]\ nodes of 
Y[j] and with the part of Y'[j] that are a subset of these leftmost \X[i] \ nodes of Y[j]. We then update Y[j] 
and Y'[j] to contain only the nodes we did not use in the call to MATCH and increment i. If \Y[j]\ < \X[i] \ we 
call match with the leftmost \Y[j] \ nodes of X[i], Y[j], and Y'\j\. We then update X[i] to contain only the 
nodes we did not use in the call to match and increment j. 

It follows by induction on the rank of the elements that the procedure is correct. 

Fl(W, a): Initialize a node array R of size % and two node lists L and S. 
Repeat until i > um- 

Until X[i] ^ set i := i + l. 

There are three cases depending on the type of i: 

1. i E {l(v,w),r(v,w)}. Compute N := FL c ^ is ^ vw ^ v ^(X[i], a). 

If N ^ for each j e {i, s{v, w), v} set R[j] = R\j] U (N n V{C(j))). 
Otherwise, set L := L o parent M (u). 

2. i = l(v). Compute N := FL c ^(X[i]). 

If N ^ for each j e {i, v}, set R[j] R[j] U (N n V(C(j))). 
Otherwise, set L := L o parent M (u). 

3. i g {l(v,w),r(v,w),l(v)}. Compute N := Fl c ^(X[i\, a). 
If N ^ set R[i] := R[i] UN. 

Otherwise set L := L o parent M (i). 

Subsequently, compute the list S := Flm(L,o.). For each node i e S set R[i] := R[i] U 
FL C (s[j])(first(S[i]), a)). Return R. 

The Fl procedure is similar to Parent. The cases 1, 2 and 3 compute Fl on a micro forest. If the result 
is within the micro tree we add it to R and otherwise we store the node in the macro tree which contains 
the parent of the root of the micro forest in a node list L. Since we always call Deep on the output from 
Fl(X, a) there is no need to compute Fl in the macro tree if N is nonempty. We then compute Fl in the 
macro tree on the list L, store the results in a list S, and use this to compute the final result. 

Consider the cases of procedure Fl. In Case 1 i is a left or right node. Due to Proposition 2 case (i) and 
(ii) fl of a node in i can be in i on the spine or in the top boundary node. If this is not the case it can be 
found by a computation of Fl of the parent of the top boundary node of the i's cluster in the macro tree 
(Proposition 2 case (in)). In Case 2 i is a leaf node. Then fl of a node in i must either be in i, in the top 
boundary node, or can be found by a computation of Fl of the parent of the top boundary node of the i's 
cluster in the macro tree. If £ is a spine node or a boundary node fl of a node in i is either in i or can be 
found by a computation of Fl of the parent of i in the macro tree. 

The correctness of the procedure follows from Proposition 2, the above, and the correctness of procedure 
FLm- 
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3.5.4 Complexity of the Tree Inclusion Algorithm 



To analyse the complexity of the node array implementation we first bound the running time of the above 
implementation of the set procedures. All procedures scan the input from left-to-right while gradually 
producing the output. In addition to this procedure Fl needs a call to a node list implementation of Fl on 
the macro tree. Given the data structure described in Section 3.5.2 it is easy to check that each step in the 
scan can be performed in 0(1) time giving a total of 0(jit/ log tit) time. Since the number of nodes in the 
macro tree is 0(tit / 'log tit) the call to the node list implementation of Fl is easily done within the same 
time. Hence, we have the following lemma. 

Lemma 22 For any tree T there is a data structure using 0(nr) space and 0(j\t log tit) preprocessing time 
which supports all of the set procedures in 0(nx / 'log Ut) time. 

Next consider computing the deep occurrences of P in T using the procedure Emb of Section 3.3 and 
Lemma 22. Since each node v £ V(P) contributes at most a constant number of calls to set procedures it 
follows immediately that, 

Theorem 8 For trees P and T the tree inclusion problem can be solved in 0(npnx / 'logriT + riTlogn-r) 
time and 0(nr) space. 

Combining the results in Theorems 6, 8 and Corollary 1 we have the main result of Theorem 5. 
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Abstract 



Given two rooted, labeled trees P and T the tree path subsequence problem is to determine which 
paths in P are subsequences of which paths in T. Here a path begins at the root and ends at a leaf. In 
this paper we propose this problem as a useful query primitive for XML data, and provide new algorithms 
improving the previously best known time and space bounds. 

4.1 Introduction 

We say that a tree is labeled if each node is assigned a character from an alphabet S. Given two sequences 
of labeled nodes p and t, we say that p is a subsequence of t, denoted p C t, if p can be obtained by removing 
nodes from t. Given two rooted, labeled trees P and T the tree path subsequence problem (TPS) is to 
determine which paths in P are subsequences of which paths in T. Here a path begins at the root and ends 
at a leaf. That is, for each path p in P we must report all paths t in T such that p C t. 

This problem was introduced by Chen [ChcOO] who gave an algorithm using 0(mm(l P n T +np, n P l T +nT)) 
time and 0{lpdp + np + hit) space. Here, rig, Is, and d$ denotes the number of nodes, number of leaves, 
and depth, respectively, of a tree S. Note that in the worst-case this is quadratic time and space. In this 
paper we present improved algorithms giving the following result: 

Theorem 9 For trees P and T the tree path subsequence problem can be solved in 0(np + ht) space with 
the following running times: 



The first two bounds in Theorem 9 match the previous time bounds while improving the space to linear. 
This is achieved using a algorithm that resembles the algorithm of Chen [CheOO]. At a high level, the 
algorithms are essentially identical and therefore the bounds should be regarded as an improved analysis 
of Chen's algorithm. The latter bound is obtained by using an entirely new algorithm that improves the 
worst-case quadratic time. Specifically, whenever log np = 0(np/ log np) the running time is improved by 
a logarithmic factor. Note that - in the worst-case - the number of pairs consisting of a path from P and a 
path T is ^(npnp), and therefore we need at least as many bits to report the solution to TPS. Hence, on 
a RAM with logarithmic word size our worst-case bound is optimal. Most importantly, all our algorithms 
use linear space. For practical applications this will likely make it possible to solve TPS on large trees and 
improve running time since more of the computation can be kept in main memory. 




+ n T + n P log np). 
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Figure 4.1: (a) The trie of queries 1,2,3, or the tree for query 4. (b) A fragment of a catalog of books. 



4.1.1 Applications 

We propose TPS as a useful query primitive for XML data. The key idea is that an XML document D 
may be viewed as a rooted, labeled tree. For example, suppose that we want to maintain a catalog of books 
for a bookstore. A fragment of a possible XML tree, denoted D, corresponding to the catalog is shown 
in Fig. 4.1(b). In addition to supporting full-text queries, such as find all documents containing the word 
"John" , we can also use the tree structure of the catalog to ask more specific queries, such as the following 
examples: 

1. Find all books written by John, 

2. find all books written by Paul, 

3. find all books with a chapter that has something to do with XML, or 

4. find all books written by John and Paul with a chapter that has something to do with XML. 

The queries 1,2, and 3 correspond to a path query on D, that is, compute which paths in D that contains 
a specific path as a subsequence. For instance, computing the paths in D that contain the path of three 
nodes labeled "book", "chapter", and "XML", respectively, effectively answers query 3. Most XML-query 
languages, such as XPath [CD99], support such queries. 

Using a depth-first traversal of D a path query can be solved in linear time. More precisely, if q is a 
path consisting of n q nodes, answering the path query on D takes 0(n q + nr>) time. Hence, if we are given 
path queries qi, . . . , qk we can answer them in 0(n qi + • • • + n qk + knr>) time. However, we can do better by 
constructing the trie, Q, of qi, . . . , qt- Answering all paths queries now correspond to solving TPS on Q and 
D. As an example the queries 1,2, and 3 form the trie shown in Fig. 4.1(a). As Iq < k, Theorem 9 gives us 
an algorithm with running time 



O [ n qi + 



+ n Qk +min I kn D +n Ql n Q l D +n D , -9—E + n D + nnlogiin 
1 ^ log;*/. 



(4.1) 



Since nq < n qi + • • • + n qk this is at least as good as answering the queries individually and better in many 
cases. If many paths share a prefix, i.e., queries 1 and 2 share "book" and "author", the size of uq can much 
smaller than n qi + • • • + n qk . Using our solution to TPS we can efficiently take advantage of this situation 



since the latter two terms in (4.1) depend on uq and not on n qi + • • • + n qk . 

Next consider query 4. This query cannot be answered by solving a TPS problem but is an instance of 
the tree inclusion problem (TI). Here we want to decide if P is included in T, that is, if P can be obtained 
from T by deleting nodes of T. Deleting a node y in T means making the children of y children of the parent 
of y and then removing y. It is straightforward to check that we can answer query 4 by deciding if the tree 
in Fig. 4.1(a) can be included in the tree in Fig. 4.1(b). 



*This work was performed while the author was a PhD student at the IT University of Copenhagen. 
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Recently, TI has been recognized as an important XML query primitive and has recieved considerable 
attention, see e.g., [SM02, YLH03, YLH04, ZADR03, SNOO, TRS02]. Unfortunately, TI is NP-complcte in 
general [KM95a] and therefore the existing algorithms are based on heuristics. Observe that a necessary 
condition for P to included in T is that all paths in P are subsequences of paths in T. Hence, we can use 
TPS to quickly identify trees or parts of trees that cannot be included T. We believe that in this way TPS 
can be used as an effective "filter" for many tree inclusion problems that occur in practice. 



4.1.2 Technical Overview 

Given two strings (or labeled paths) a and b, it is straightforward to determine if a is a subsequence of b by 
scanning the character from left to right in b. This uses 0(\a\ + \b\) time. We can solve TPS by applying 
this algorithm to each of the pair of paths in P and T, however, this may use as much as 0(npriT(np +tit)) 
time. Alternatively, Baeza- Yates [BY91] showed how to preprocess b in 0(\b\ log |6|) time such that testing 
whether a is a subsequence of b can be done in 0(\a\ log \ b\) time. Using this data structure on each path in 
T we can solve the TPS problem, however, this may take as much as 0(n T logny + np \ogn T ). Hence, none 
of the availiablc subsequence algorithms on strings provide an immediate efficient solution to TPS. 

Inspired by the work of Chen [CheOO] we take another approach. We provide a framework for solving 
TPS. The main idea is to traverse T while maintaining a subset of nodes in P, called the state. When 
reaching a leaf z in T the state represents the paths in P that are a subsequences of the path from the root 
to z. At each step the state is updated using a simple procedure defined on subset of nodes. The result of 
Theorem 9 is obtained by taking the best of two algorithms based on our framework: The first one uses a 
simple data structure to maintain the state. This leads to an algorithm using 0{mh\{lpriT + np, nplx + nr)) 
time. At a high level this algorithm resembles the algorithm of Chen [CheOO] and achieves the same running 
time. However, we improve the analysis of the algorithm and show a space bound of 0(np + tiT)- This should 
be compared to the worst-case quadratic space bound of 0(lpdx + np + tit) given by Chen [CheOO]. Our 
second algorithm takes a different approach combining several techniques. Starting with a simple quadratic 
time and space algorithm, we show how to reduce the space to 0(np lognr) using a decomposition of T into 
disjoint paths. We then divide P into small subtrees of logarithmic size called micro trees. The micro trees 
are then preprocessed such that subsets of nodes in a micro tree can be maintained in constant time and 
space. Intuitively, this leads to a logarithmic improvement of the time and space bound. 



4.1.3 Notation and Definitions 

In this section we define the notation and definitions we will use throughout the paper. For a graph G we 
denote the set of nodes and edges by V(G) and E(G), respectively. Let T be a rooted tree. The root of T 
is denoted by root(T). The size of T, denoted by tit, is |V(T)|. The depth of a node y € V(T), depth(y), is 
the number of edges on the path from y to root(T) and the depth of T, denoted dr, is the maximum depth 
of any node in T. The parent of y is denoted parent(y). A node with no children is a leaf and otherwise 
it is an internal node. The number of leaves in T is denoted It- Let T(y) denote the subtree of T rooted 
at a node y e V(T). If z e V(T(y)) then y is an ancestor of z and if z G V(T(y))\{y} then y is a proper 
ancestor of z. If y is a (proper) ancestor of z then z is a (proper) descendant of y. We say that T is labeled 
if each node y is assigned a character, denoted label(y), from an alphabet S. The path from y to root(T), of 
nodes root(T) = t/i, . . . , yk = y is denoted path(y). Hence, we can formally state TPS as follows: Given two 
rooted tree P and T with leaves x\,...,x r and j/i, . . . ,y s , respectively, determine all pairs (i,j) such that 
path(xj) C path(j/j). For simplicity we will assume that leaves in P and T are always numbered as above 
and we identify each of the paths by the number of the corresponding leaf. 

Throughout the paper we assume a unit-cost RAM model of computation with word size O(logny) and a 
standard instruction set including bitwise boolean operations, shifts, addition and multiplication. All space 
complexities refer to the number of words used by the algorithm. 
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Figure 4.2: The letters inside the nodes are the labels, and the identifier of each node is written outside 
the node. Initially we have X = {root(P)}. Since label(root(P)) — a — label(root(T)) we replace root(P) 
with is children and get X root ( T ^ — {xi,X2}- Since label(l) = label(xi) ^ label(x2) we get X\ = {x3,X2}- 
Continuing this way we get X2 = {J-i,X2}, X3 = {J_i,J_2}, X4 = {x3,_l_2}, and X5 = {x3,J_2}- The nodes 
3 and 5 are leaves of T and we thus report paths 1 and 2 after computing X 3 and path 2 after computing 



4.2 A Framework for solving TPS 

In this section we present a simple general algorithm for the tree path subsequence problem. The key 
ingredient in our algorithm is the following procedure. For any X C V(P) and y € V(T) define: 

Down(1,?/): Return the set Child({x e X | label(x) = label(y)}) U {x e X | label(x) ^ label(j/)}. 

The notation Child(X) denotes the set of children of X. Hence, Down(X, y) is the set consisting of nodes 
in X with a different label than y and the children of the nodes X with the same label as y. We will now 
show how to solve TPS using this procedure. 

First assign a unique number in the range {1, . . . , lp} to each leaf in P. Then, for each i, 1 <i < lp, add 
a pseudo-leaf _U as the single child of the ith leaf. All pseudo-leaves are assigned a special label £ S. The 
algorithm traverses T in a depth first order and computes at each node y the set X y . We call this set the 
state at y. Initially, the state consists of {root(P)}. For z e child(y), the state X z can be computed from 
state X v as 



If z is a leaf we report the number of each pseudo-leaf in X z as the paths in P that are subsequences of 
path(z). See Figure 4.2 for an example. To show the correctness of this approach wc need the following 
lemma. 

Lemma 23 For any node y € V(T) the state X y satisfies the following property: 



Proof. By induction on the number of iterations of the procedure. Initially, X — {root(P)} satisfies the 
property since root(P) has no parent. Suppose that X y is the current state and z G child (y) is the next 
node in the depth first traversal of T. By the induction hypothesis X y satisfies the property, that is, for any 
x e X y , path(parent(x)) C path(y)). Then, 

X z = Down(I„, z) = Child({x e X y I label(x) = label(z)}) U{iel s label(x) ^ label(z)} . 

Let 1 be a node in X y . There are two cases. If label(x) = label(z) then path(x) C path(z) since 
path(parent(x)) C path(y). Hence, for any child x' of x we have path(parent(x')) C path(z). On the 
other hand, if label(x) ^ label(^) then x e X z . Since y — parent(z) we have path(y) C path(z), and hence 
path(parent(x)) Q path(y) C path(z). □ 

By the above lemma all paths reported at a leaf z e V(T) are subsequences of path(^). The following lemma 
shows that the paths reported at a leaf z G V(T) are exactly the paths in P that are subsequences of path(z). 



X 5 . 



X z = BowN(X y , z). 



x € Xy => path(parent(x)) C path(y) . 
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Lemma 24 Let z be a leaf in T and let _U be a pseudo-leaf in P. Then, 

1; £ I 2 «• path(parcnt(_U)) C path(z) . 

Proof. It follows immediately from Lemma 23 that _L, £ X z => path(parcnt(_l_i)) C path(z). It remains 
to show that path(parent(_U)) C path(z) => _L, £ X z . Let path(2) = zi,...,z/j, where zi = root(T) 
and Zk = z, and let path(parent(_l_j)) = yi, • ■ • ,ye, where y\ = root(P) and yi = parent(_l_i). Since 
path(parent(J_i)) C path(z) there are nodes Zj i = yi for 1 < i < k, such that (i) ji < jj+i and (ii) there 
exists no node Zj with label(zj) = label(j/i), where ji-\ < j < ji. Initially, X = {root(P)}. We have 
root(P) £ X Zj for all j < j\, since Zj 1 is the first node on path(z) with label label(root(P)). When we get to 
Zj 1 , root(P) is removed from the state and yi is inserted. Similarly, yi is in all states X Zj for ji_\ < j < ji. 
It follows that ±i is in all states X z . where j > ji and thus _U £ X Zfc = X z . □ 

The next lemma can be used to give an upper bound on the number of nodes in a state. 

Lemma 25 For any y £ V(T) the state X y has the following property: Let x e X y . Then no ancestor of x 
is in X y . 

Proof. By induction on the length of path(y). Initially, the state only contains root(P). Let z be the 
parent of y, and thus X y is computed from X z . First we note that for all nodes x e X y cither x £ X z 
or parent (x) £ X z . If x £ X z it follows from the induction hypothesis that no ancestor of x is in X z , and 
thus no ancestors of x can be in X y . If parent(a;) £ X z then due to the definition of Down we must have 
label(x) = label(y). It follows from the definition of Down that parent(x) ^ X y . □ 

It follows from Lemma 25 that \X y \ < lp for any y £ V(T). If we store the state in an unordered linked list 
each step of the depth-first traversal takes time 0(lp) giving a total 0(lpnT + np) time algorithm. Since 
each state is of size at most lp the space used is 0(np + Ipur)- In the following sections we show how to 
improve these bounds. 

4.3 A Simple Algorithm 

In this section we consider a simple implementation of the above algorithm, which has running time 
O (min(lpnT + np,nplp + np)) and uses 0(np + np) space. We assume that the size of the alphabet 
is np + np and each character in £ is represented by an integer in the range {1, . . . , np + np}. If this is not 
the case we can sort all characters in V(P) U V(T) and replace each label by its rank in the sorted order. 
This does not change the solution to the problem, and assuming at least a logarithmic number of leaves in 
both trees it does not affect the running time. To get the space usage down to linear we will avoid saving 
all states. For this purpose we introduce the procedure Up, which reconstructs the state X z from the state 
X y , where z = parent(y). We can thus save space as we only need to save the current state. 

We use the following data structure to represent the current state X y : A node dictionary consists of two 
dictionaries denoted X c and X p . The dictionary X c represents the node set corresponding to X y , and the 
dictionary X p represents the node set corresponding to the set {x £ X z \ x £" X y and z is an ancestor of y}. 
That is, X c represents the nodes in the current state, and X p represents the nodes that is in a state X z , 
where z is an ancestor of y in T, but not in X y . We will use X p to reconstruct previous states. The 
dictionary X c is indexed by S and X p is indexed by V{T). The subsets stored at each entry are represented 
by doubly-linked lists. Furthermore, each node in X c maintains a pointer to its parent in X p and each node 
x' in X p stores a linked list of pointers to its children in X p . With this representation the total size of the 
node dictionary is 0(np + np). 

Next we show how to solve the tree path subsequence problem in our framework using the node dictionary 
representation. For simplicity, we add a node T to P as a the parent of root(P). Initially, the X p represents 
T and X c represents root(P). The DOWN and Up procedures are implemented as follows: 
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1. Set X := X c [\ahe\(y)] and X c [label(y)] := 0. 

2. For each x £ X do: 

(a) Set XP[y] := XP[y] U {x}. 

(b) For each x' £ child (x) do: 

i. Set X c [label(x')] := X c [label(x')] U {x}. 

ii. Create pointers between x' and x. 

3. Return {XP,X c ). 

1. Set X := and := 0. 

2. For each x £ X do: 

(a) Set X c [labcl(x)] := X c [label(x)] U {x}. 

(b) For each x' £ child(x) do: 

i. Remove pointers between x' and x. 

ii. Set X c [label(x')] := X c [label(x')] \ {a;'}. 

3. Return (XP,X C ). 

The next lemma shows that Up correctly reconstructs the former state. 

Lemma 26 Let X z = (X C ,X P ) be a state computed at a node z £ V(T), and let y be a child of z. Then, 

X z = Up(DowN(X z ,y),y) . 

Proof. Let (Xf , X{) = Down(1 z , y) and (X 2 C , X v 2 ) = Up((Xf , Xf ), y). We will first show that x £ X z => 
x e Up(DowNpf z ,y),y). 

Let x be a node in X c . There are two cases. If x € X c [label(y)], then it follows from the implementation 
of Down that x £ X[[y\. By the implementation of Up, x e [y] implies x G If x ^ X c [label(j/)] 
then x e Xf. We need to show parent (x) ^ X\ [y]. This will imply x € X|, since the only nodes removed 
from Xf when computing are the nodes with a parent in X\ [y]. Since y is unique it follows from the 
implementation of Down that parent(x) G X\ implies x £ X c [label(y)]. 

Let x be a node in X p . Since y is unique we have x £ X p [y'] for some y' ^ y. It follows immediately 
from the implementation of Up and Down that X p [y'} = Xf[y'] = X%[y'], when y' ^ y, and thus X p = X%. 

We will now show x £ Up(Down(X z , y), y) => x £ X z . Let x be a node in X|. There are two cases. If 
x XI then it follows from the implementation of Up that x £ X p [y\. By the implementation of Down, 
x £ Xf[y] implies x £ X c [\abe\(y)}, i.e., x £ X c . If x £ X{ then by the implementation of Up, x £ X^ 
implies parent(x) ^ x\[y]. It follows from the implementation of Down that x £ X c . Finally, let x be a 
node in X%- As argued above X p — X%, and thus x £ X p . □ 



D0WN((XP,A-),y): 



XJr((XP,X c ),y): 



From the current state X y = (X c , X p ) the next state X z is computed as follows: 




DowNpf y , z) if y = parent(z), 
XJp(X y , y) if z = parent (y). 



The correctness of the algorithm follows from Lemma 24 and Lemma 26. We will now analyze the running 
time of the algorithm. The procedures Down and Up uses time linear in the size of the current state and 
the state computed. By Lemma 25 the size of each state is 0(1 p). Each step in the depth- first traversal thus 
takes time 0(lp), which gives a total running time of 0(lpnr + np). On the other hand consider a path t in 
T. We will argue that the computation of all the states along the path takes total time 0(np + n t ), where 
np is the number of nodes in t. To show this we need the following lemma. 

Lemma 27 Let t be a path in T. During the computation of the states along the path t, any node x £ V(P) 
is inserted into X c at most once. 
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Proof. Since t is a path we only need to consider the Down computations. The only way a node x G V(P) 
can be inserted into X c is if parent(x) e X c . It thus follows from Lemma 25 that x can be inserted into X c 
at most once. □ 



It follows from Lemma 27 that the computations of the all states when T is a path takes time 0(np + nx). 
Consider a path-decomposition of T. A path-decomposition of T is a decomposition of T into disjoint paths. 
We can make such a path-decomposition of the tree T consisting of It paths. Since the running time of Up 
and Down both are linear in the size of the current and computed state it follows from Lemma 26 that 
we only need to consider the total cost of the Down computations on the paths in the path-decompostion. 
Thus, the algorithm uses time at most J2 teT 0(n p + n t ) — 0(upIt + nr). 

Next we consider the space used by the algorithm. Lemma 25 implies that \X C \ < lp. Now consider the 
size of X p . A node is inserted into X p when it is removed from X c . It is removed again when inserted into 
X c again. Thus Lemma 27 implies \X P \ < np at any time. The total space usage is thus 0(np + tit). To 
summarize we have shown, 

Theorem 10 For trees P and T the tree path subsequence problem can be solved in 0(np + ut) space and 
O (min(lpnT + np,nplT + ut)) time. 



4.4 A Worst-Case Efficient Algorithm 



In this section we consider the worst-case complexity of TPS and present an algorithm using subquadratic 
running time and linear space. The new algorithm works within our framework but does not use the Up 
procedure or the node dictionaries from the previous section. 

Recall that using a simple linked list to represent the states we immediately get an algorithm using 
0{npTiT) time and space. We first show how to modify the traversal of T and discard states along the 
way such that at most O(lognT) states are stored at any step in the traversal. This improves the space to 
0(np log nr). Secondly, we decompose P into small subtrees, called micro trees, of size Oilognx). Each 



log tit 



space. 



micro tree can be represented in a single word of memory and therefore a state uses only 0( 

In total the space used to represent the O(lognT) states is 0( lci g^ T ■ logrir) = 0(np + log tit). Finally, 
we show how to preprocess P in linear time and space such that computing the new state can be done in 
constant time per micro tree. Intuitively, this achieves the O(logn-r) speedup. 



4.4.1 Heavy Path Traversal 

In this section we present the modified traversal of T. We first partition T into disjoint paths as follows. For 
each node y € V(T) let size(y) = |V(T(y))|. We classify each node as either heavy or light as follows. The 
root is light. For each internal node y we pick a child z of y of maximum size among the children of y and 
classify z as heavy. The remaining children are light. An edge to a light child is a light edge, and an edge to 
a heavy child is a heavy edge. The heavy child of a node y is denoted heavy (y). Let ldepth(y) denote the 
number of light edges on the path from y to root(T). 

Lemma 28 (Harel and Tarjan [HT84]) For any tree T and node y e V(T), ldepth(y) < logn T + 0(1). 

Removing the light edges, T is partitioned into heavy paths. We traverse T according to the heavy paths 
using the following procedure. For node y £ V(T) define: 



Visit(j/): 1. If y is a leaf report all leaves in X y and return. 

2. Else let yi,. ■ ■ ,yk be the light children of y and let z = heavy(y). 

3. For i := 1 to k do: 
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(a) Compute X yi := DoWNpfj,, j/j) 

(b) Compute VlSlT(j/j). 

4. Compute X z := Down(J„, z). 

5. Discard X y and compute Visit (z). 

The procedure is called on the root node of T with the initial state {root(P)}. The traversal resembles a 
depth first traversal, however, at each step the light children are visited before the heavy child. We therefore 
call this a heavy path traversal. Furthermore, after the heavy child (and therefore all children) has been 
visited we discard X y . At any step we have that before calling Visrr(y) the state X y is availiable, and 
therefore the procedure is correct. We have the following property: 

Lemma 29 For any tree T the heavy path traversal stores at most log tit + 0(1) states. 

Proof. At any node y £ V(T) we store at most one state for each of the light nodes on the path from y to 
root(T). Hence, by Lemma 28 the result follows. □ 

Using the heavy-path traversal immediately gives an 0(npnr) time and 0(np\ognx) space algorithm. In 
the following section we improve the time and space by an additional O(lognr) factor. 

4.4.2 Micro Tree Decomposition 

In this section we present the decomposition of P into small subtrees. A micro tree is a connected subgraph of 
P. A set of micro trees MS is a micro tree decomposition iff V(P) = U MeMsV(M) and for any M, M' £ MS, 
(y(M)\{root(M)}) n (U(M')\{root(M')}) = 0. Hence, two micro trees in a decomposition share at most 
one node and this node must be the root in at least one of the micro trees. If root(M') £ V(M) then M is 
the parent of M' and M' is the child of M. A micro tree with no children is a leaf and a micro tree with no 
parent is a root. Note that we may have several root micro trees since they can overlap at the node root(P). 
We decompose P according to the following classic result: 

Lemma 30 (Gabow and Tarjan [GT83]) For any tree P and parameter s > 1, it is possible to build a 
micro tree decomposition MS of P in linear time such that \MS\ — 0(\np/s]) and \V(M)\ < s for any 

M e MS 

4.4.3 Implementing the Algorithm 

In this section we show how to implement the Down procedure using the micro tree decomposition. First 
decompose P according to Lemma 30 for a parameter s to be chosen later. Hence, each micro tree has at 
most s nodes and \MS\ — 0(\np/s]). We represent the state X compactly using a bit vector for each micro 
tree. Specifically, for any micro tree M we store a bit vector Xm = [bi, ■ ■ . , b s ], such that Xm\i\ = 1 iff the 
ith node in a preorder traversal of M is in X. If |U(M)| < s we leave the remaining values undefined. Later 
we choose s = 8(logny) such that each bit vector can be represented in a single word. 

Next we define a Down^ procedure on each micro tree M £ MS. Due to the overlap between micro 
trees the DowNjy procedure takes a bit b which will be used to propagate information between micro trees. 
For each micro tree M £ MS, bit vector Xm, bit b, and y £ V(T) define: 

DoWN M (X M ,b,y): Compute the state X' M := Child({x £ X M | label(x) = label(y)}) U {x £ X M \ 
label (x) ^ label(y)}. If b = 0, return X' M , else return X' M U {root(Af)}. 

Later we will show how to implemenent Downm in constant time for s = 0(logriT). First we show how to 
use Downm to simulate Down on P. We define a recursive procedure Down which traverse the hiearchy 
of micro trees. For micro tree M, state X, bit b, and y £ V(T) define: 

Down(A, M, b,y): Let Mi, ... , M k be the children of M. 
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1. Compute X M ■= Down m (I m , b, y). 

2. For i := 1 to k do: 

(a) Compute Down(X, Mj, 6,, y), where bj = 1 iff 
root(Mj) e X M - 

Intuitively, the Down procedure works in a top-down fashion using the b bit to propagate the new state of the 
root of micro tree. To solve the problem within our framework we initially construct the state representing 
{root(P)}. Then, at each step we call DowN(i?j, 0, y) on each root micro tree Rj. We formally show that 
this is correct: 

Lemma 31 The above algorithm correctly simulates the Down procedure on P. 

Proof. Let X be the state and let X' := Down(X, j/). For simplicity, assume that there is only one root 
micro tree R. Since the root micro trees can only overlap at root(P) it is straightforward to generalize the 
result to any number of roots. We show that if X is represented by bit vectors at each micro tree then calling 
Down(_R, 0,y) correctly produces the new state X'. 

If R is the only micro tree then only line 1 is executed. Since b = this produces the correct state by 
definition of Downm- Otherwise, consider a micro tree M with children Mi, . . . , Mk and assume that b — 1 
iff root(M) G X' . Line 1 computes and stores the new state returned by Downm- If 6 = the correctness 
follows immediately If b = 1 observe that Downm first computes the new state and then adds root(M). 
Hence, in both cases the state of M is correctly computed. Line 2 recursively computes the new state of the 
children of M. □ 



If each micro tree has size at most s and Downm can be computed in constant time it follows that 
the above algorithm solves TPS in 0(\np/s~\) time. In the following section we show how to do this for 
s = Q (log tit), while maintaining linear space. 

4.4.4 Representing Micro Trees 

In this section we show how to preprocess all micro trees M G MS such that DowNm can be computed 
in constant time. This preprocessing may be viewed as a "Four Russian Technique" [ADKF70]. To achieve 
this in linear space we need the following auxiliary procedures on micro trees. For each micro tree M, bit 
vector Xm, and a G £ define: 

ChilDm(^m): Return the bit vector of nodes in M that are children of nodes in Xm- 
EQ M (a): Return the bit vector of nodes in M labeled a. 

By definition it follows that: 

Child m (X m n EQ M (label(y))) U 

(X M \(X M n EQ M (label(2/))) if b = 0, 

Child m (Xm n EQ M (label(y))) U 

(X M \(X M n EQ M (labe%))) U {root(M)} if b = 1. 

Recall that the bit vectors are represented in a single word. Hence, given Child^ and Eq m we can compute 
Downm using standard bit-operations in constant time. 

Next we show how to efficiently implement the operations. For each micro tree M G MS we store the 
value EQ M (a) in a hash table indexed by a. Since the total number of different characters in any M G MS 
is at most s, the hash table Eq m contains at most s entries. Hence, the total number of entries in all 
hash tables is 0(np). Using perfect hashing we can thus represent Eq m for all micro trees, M G MS, in 
0(np) space and 0(1) worst-case lookup time. The preprocessing time is expected 0(np) w.h.p.. To get 



DOWN M (^M,&,y) 
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a worst-case bound we use the deterministic dictionary of Hagerup et. al. [HMP01] with 0((np) log(np)) 
worst-case preprocessing time. 

Next consider implementing Childm ■ Since this procedure is independent of the labeling of M it suffices 
to precompute it for all topologically different rooted trees of size at most s. The total number of such trees 
is less than 2 2s and the number of different states in each tree is at most 2 s . Therefore Childm has to be 
computed for a total of 2 2s ■ 2 s = 2 3s different inputs. For any given tree and any given state, the value of 
Childm can be computed and encoded in O(s) time. In total we can precompute all values of ChilDm in 
0(s2 3s ) time. Choosing the largest s such that 3s + logs < tit (hence s = 6(lognr)) this uses 0{nr) time 
and space. Each of the inputs to Childm are encoded in a single word such that we can look them up in 
constant time. 

Finally, note that we also need to report the leaves of a state efficiently since this is needed in line 1 in 
the ViSiT-procedure. To do this compute the state L corresponding to all leaves in P. Clearly, the leaves of 
a state X can be computed by performing a bitwise AND of each pair of bit vectors in L and X. Computing 
L uses 0{np) time and the bitwise AND operation uses 0(\np/s]) time. 

Combining the results, we decompose P, for s as described above, and compute all values of Eq m and 
Childm- Then, we solve TPS using the heavy-path traversal. Since s = 9(logriT) and from Lemmas 29 
and 30 we have the following theorem: 

Theorem 11 For trees P and T the tree path subsequence problem can be solved in 0(np + ut) space and 
0(^ + n T + n P lognp) time. 

Combining the results of Theorems 10 and 11 this proves Theorem 9. 
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Abstract 



The use of word operations has led to fast algorithms for classic problems such as shortest paths 
and sorting. Many classic problems in stringology, notably regular expression matching and its variants, 
as well as edit distance computation, also have transdichotomous algorithms. Some of these algorithms 
have alphabet restrictions or require a large amount of space. In this paper, we improve on several of 
the keys results by providing algorithms that improve on known time/space bounds, or algorithms that 
remove restrictions on the alphabet size. 

5.1 Introduction 

Transdichotomous algorithms [FW93, FW94] allow logarithmic-sized words to be manipulated in constant 
time. Many classic problems, such as MST [FW94], Shortest Paths [Tho99] and Sorting [HT02], have 
fast transdichotomous algorithms. Many classic stringology problems also have transdichotomous solutions, 
though some of these, such as Myers algorithm for regular expression matching [Mye92a] uses a lot of space, 
whereas others, such as the algorithm by Masek and Paterson [MP80] for edit distance computation requires 
that the alphabet be of constant size. 

In this paper, we give improved algorithms for several such classic problems. In particular: 

Regular Expression Matching Given a regular expression R and a string Q, the Regular Expression 
Matching problem is to determine if Q is a member of the language denoted by R. This problem occurs in 
several text processing applications, such as in editors like Emacs [Sta81] or in the Grep utilities [WM92a, 
NavOlb]. It is also used in the lexical analysis phase of compilers and interpreters, regular expressions are 
commonly used to match tokens for the syntax analysis phase, and more recently for querying and validating 
XML databases, see e.g., [HP01,LM01,Mur01,BML+04]. The standard textbook solution to the problem, 
due to Thompson [Tho68], constructs a non-deterministic finite automaton (NFA) for R and simulates it on 
the string Q. For R and Q of sizes m and n, respectively, this algorithm uses 0(mn) time and 0(m) space. If 
the NFA is converted into a deterministic finite automaton (DFA), the DFA needs 0(^2 2 "V) words, where 
a is the size of the alphabet S and w is the word size. Using clever representations of the DFA the space 
can be reduced to 0(^(2 m + a)) [WM92b, NR04]. 

Normally, it is reported that the running time of traversing the DFA is 0(n), but this complexity analysis 
ignores the word size. Since nodes in the DFA may need O(m) bits to be addressed, we may need fl(m/w+ 1) 
time to identify the next node in the traversal. Therefore the running time becomes 0(mn/w + n + m) with a 
potential exponential blowup in the space. Hence, in the transdichotomous model, where w is 0(log(n + m)), 
using worst-case exponential preprocessing time improves the query time by a log factor. The fastest known 
algorithm is due to Myers [Mye92a], who showed how to achieve 0(mn/k + m2 k + (n + m)logm) time 
and 0(2 k m) space, for any k < w. In particular, for k = log(n/logn) this gives an algorithm using 
0(mn/logn+ (n + to) log to) time and 0(TOn/logn) space. 

In Section 5.2, we present an algorithm for Regular EXPRESSION Matching that takes time 0(nm/k + 
n + m log m) time and uses 0(2 k + to) space, for any k < w. In particular, if we pick k = logn, we are (at 
least) as fast as the algorithm of Myers, while achieving 0(n + m) space. 
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Approximate Regular Expression Matching Motivated by applications in computational biology, 
Myers and Miller [MM89] studied the Approximate Regular Expression Matching problem. Here, 
we want to determine if Q is within edit distance d to any string in the language given by R. The edit 
distance between two strings is the minimum number of insertions, deletions, and substitutions needed 
to transform one string into the other. Myers and Miller [MM89] gave an 0(mn) time and 0(m) space 
dynamic programming algorithm. Subsequently, assuming as a constant sized alphabet, Wu, Manber and 
Myers [WMM95] gave an fl( m "|°g(^+ 2 ) _[_ n _[_ m ^ time and Q( m \/"J°s( d + 2 ) _|_ n _|_ TO ) S p ace algorithm. Recently, 
an exponential space solution based on DFAs for the problem has been proposed by Navarro [Nav04] . 

In Section 5.3, we extend our results of Section 5.2 and give an algorithm, without any assumption on 
the alphabet size, using Q( TO " lo s( d + 2 ) _|_ n _|_ m l g m ) time and 0(2 k + m) space, for any k < w. 

Subsequence Indexing We also consider a special case of regular expression matching. Given text T, 
the Subsequence Indexing problem is to preprocess T to allow queries of the form "is Q a subsequence of 
T?" Baeza- Yates [BY91] showed that this problem can be solved with 0(n) preprocessing time and space, 
and query time O(mlogn), where Q has length m and T has length n. Conversely, one can achieve queries 
of time 0(m) with 0{na) preprocessing time and space. As before, a is the size of the alphabet. 

In Section 5.4, we give an algorithm that improves the former results to 0(m log log a) query time or the 
latter result to 0(na e ) preprocessing time and space. 

String Edit Distance We conclude by giving a simple way to improve the complexity of the String 
Edit Distance problem, which is defined as that of computing the minimum number of edit operations 
needed to transform given string S of length m into given string T of length n. The standard dynamic 
programming solution to this problem uses 0(mn) time and 0(min(m, n)) space. The fastest algorithm for 
this problem, due to Masek and Paterson [MP80], achieves 0(mn/k 2 + m + n) time and 0(2 k + min(n, m)) 
space for any k < w. However, this algorithm assumes a constant size alphabet. 

In Section 5.5, we show how to achieve 0(nm\ogk/k 2 + m + n) time and 0(2 k + min(n, m)) space for 
any k < w for an arbitrary alphabet. Hence, we remove the dependency of the alphabet at the cost of a 
log k factor to the running time. 

5.2 Regular Expression Matching 

Given an string Q and a regular expression R the Regular Expression Matching problem is to determine 
if Q is in the language given by R. Let n and m be the sizes of Q and R, respectively. In this section we 
show that Regular Expression Matching can be solved in 0(mn/k + n + m\ogm) time and 0(2 k +m) 
space, for k < w. 

5.2.1 Regular Expressions and NFAs 

We briefly review Thompson's construction and the standard node set simulation. The set of regular expres- 
sions over S is defined recursively as follows: 

• A character a e £ is a regular expression. 

• If S and T are regular expressions then so is the catenation, (S) ■ (T) , the union, (S) | (T) , and the star, 

(sy. 

Unnecessary parentheses can be removed by observing that • and | are associative and by using the standard 
precedence of the operators, that is * precedes •, which in turn precedes |. Furthermore, we will often remove 
the • when writing regular expressions. The language L(R) generated by R is the set of all strings matching 
R. The parse tree T(R) of R is the rooted and ordered tree representing the hiearchical structure of R. 
All leaves are represented by a character in S and all internal nodes are labeled •, |, or *. We assume that 
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(c) (d) 



Figure 5.1: Thompson's NFA construction. The regular expression for a character a G X correspond to NFA 
(a). If S and T are regular expression then N{ST), N(S\T), and N(S*) correspond to NFAs (a), (6), and 
(c), respectively. Accepting nodes are marked with a double circle. 

parse trees are binary and constructed such that they arc in one-to-one correspondance with the regular 
expressions. An example parse tree of the regular expression ac\a*b is shown in Fig. 5.2(a). 
A finite automaton A is a tuple A = (G, £, 6*, $) such that, 

• G is a directed graph, 

• Each edge e € E(G) is labeled with a character a e £ or e, 

• # G ^(G) is a start node, 

• $ C ^(G) is the set of accepting nodes. 

A is a deterministic finite automaton (DFA) if A does not contain any e-edges, and for each node v € V(G) 
all outcoming edges have different labels. Otherwise, A is a non- deterministic automaton (NFA). We say 
that A accepts a string Q if there is a path from 9 to a node in $ which spells out Q. 

Using Thompson's method [Tho68] we can recursively construct an NFA N(R) accepting all strings in 
L(R). The set of rules is presented below and illustrated in Fig. 5.1. 

• N(a) is the automaton consisting of a start node 9 a , accepting node <f> a , and an a-edge from 6 a to 4> a . 

• Let N(S) and N(T) be automata for regular expression S and T with start and accepting nodes 9s, 
9 Tl <f>s, and (j> T , respectively. Then, NFAs for N(S-T), N(S\T), and N(S*) arc constructed as follows: 

N(ST): Merge the nodes 4>s and 9t into a single node. The new start node is 9s and the new 
accepting node is 4>t- 

N(S\T): Add a new start node 9 S \ T and new accepting node <f>s\T- Then, add e edges from 9$\t to 9s 

and 9t, and from <j>s and <pT to <j)s\T- 
N(S*): Add a new start node 9s* and new accepting node <f)s*- Then, add e edges from 9s* to 9s 

and 4>s*, and from (f>s to 4>s* and 9s- 

By construction, N(R) has a single start and accepting node, denoted 9 and 0, respectively. 9 has no 
incoming edges and <f> has no outcoming edges. The total number of nodes is at most 2m and since each 
node has at most 2 outgoing edges that the total number of edges is less than 4m. Furthermore, all incoming 
edges have the same label, and we denote a node with incoming a-edges an ct-node. Note that the star 
construction in Fig. 5.1(d) introduces an edge from the accepting node of N(S) to the start node of N(S). 
All such edges in N(R) are called back edges and all other edges are forward edges. We need the following 
important property of N(R). 
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Lemma 32 (Myers [Mye92a]) Any cycle-free path in N(R) contains at most one back edge. 

For a string Q of length n the standard node-set simulation of N(R) on Q produces a sequence of node-sets 
So, ■ ■ ■ , S n . A node v is in S, iff there is a path from 9 to v that spells out the ith prefix of Q. The simulation 
can be implemented with the following simple operations. Let S be a node-set in N(R) and let a be a 
character in E. 

Move(S', a): Compute and return the set of nodes reachable from S via a single a-edge. 
Close(S'): Compute and return the set of nodes reachable from S via or more e-edges. 

The number of nodes and edges in N(R) is 0(m), and both operations arc implementable in 0(m) time. 
The simulation proceed as follows: Initially, So := Close({8}). If Q[j] = a, 1 < j < n, then Sj := 
Close(Move(Sj_i, a)). Finally, Q e L(R) iff <j> e S n . Since each node-set Sj only depends on Sj-\ this 
algorithm uses 0(mn) time 0(m) space. 

5.2.2 Outline of Algorithm 

The algorithm presented in the following section resembles the one by Myers [Mye92a] . The key to improving 
the space is the use of compact data structures and an efficient encoding of small automatons. We first present 
a clustering of T(R) in Section 5.2.3. This leads to a decomposition of N(R) into small subautomata. In 
Section 5.2.4 we define appropiate Move and Close operations on the subautomata. With these we show 
how to simulate the node-set algorithm on N(R). Finally, in Section 5.2.5 we give a compact representation 
for the Move and Close operations on subautomata of size 6(fc). The representation allows constant time 
simulation of each subautomata leading to the speedup. 

5.2.3 Decomposing the NFA 

In this section we show how to decompose N(R) into small subautomata. In the final algorithm transitions 
through these subautomata will be simulated in constant time. The decomposition is based on a clustering 
of the parse tree T(R). Our decomposition is similar to the one given in [Mye92a, WMM95] . A cluster C is 
a connected subgraph of T(R). A cluster partition CS is a partition of the nodes of T{R) into node-disjoint 
clusters. Since T{R) is a binary tree, a bottom-up procedure yields the following lemma. 

Lemma 33 For any regular expression R of size m and a parameter x, it is possible to build a cluster 
partition CS ofT(R), such that \CS\ = 0(m/x) and for any C € CS the number of nodes in C is at most 
x. 

An example clustering of a parse tree is shown in Fig. 5.2(b). 

Before proceding, we need some definitions. Assume that CS is a cluster partition of T(R) for a some 
yet-to-be-dctcrmincd parameter x. Edges adjacent to two clusters are external edges and all other edges are 
internal edges. Contracting all internal edges induces a macro tree, where each cluster is represented by a 
single macro node. Let C v and C w be two clusters with corresponding macro nodes v and w. We say that 
C v is a parent cluster (resp. child cluster) of C w if v is the parent (resp. child) of w in the macro tree. The 
root cluster and leaf clusters are the clusters corresponding to the root and the leaves of the macro tree. 

Next we show how to decompose N(R) into small subautomata. Each cluster C will correspond to a 
subautomaton A and we use the terms child, parent, root, and leaf for subautomata in the same way we do 
with clusters. For a cluster C, we insert a special pseudo-node pi for each child cluster C\, . . . , Ci in the middle 
of the external edge connecting C and Ci. Now, C"s subautomaton A is the automaton corresponding to 
the parse tree induced by the set of nodes V(C) U {pi, . . . ,pi}. The pseudo-nodes are alphabet placeholders, 
since the leaves of a well-formed parse tree must be characters. 

In A, child automaton Ai is represented by its start and accepting node Oai and 4>Ai and a pseudo-edge 
connecting them. An example of these definitions is given in Fig. 5.2. Any cluster C of size at most x has 
less than 2x pseudo-children and therefore the size of the corresponding subautomaton is at most 6x. Note, 
therefore, that automata derived from regular expressions can be thus decomposed into 0(m/z) subautomata 
each of size at most z, by Lemma 33 and the above construction. 
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(d) (e) 



Figure 5.2: (a) The parse tree for the regular expression ac\a*b. (b) A clustering of (a) into node-disjoint 
connected subtrees Ci, C2, and C3. Here, x = 3. (c) The clustering from (b) extended with pseudo-nodes, 
(d) The automaton for the parse tree divided into subautomata corresponding to the clustering, (e) The 
subautomaton A\ with pseudo-edges corresponding to the child automata. 

5.2.4 Simulating the NFA 

In this section we show how to do a node-set simulation of N(R) using the subautomata. Recall that each 
subautomaton has size less than z. Topologically sort all nodes in each subautomaton A ignoring back 
edges. This can be done for all subautomata in total 0(m) time. We represent the current node-set S of 
N(R) compactly using a bitvector for each subautomaton. Specifically, for each subautomaton A we store a 
characteristic bitvector B — [bi, . . . , b z ], where nodes in B are indexed by the their topological order, such 
that B[i] = 1 iff the ith node is in S. If A contains fewer than z nodes we leave the remaining values 
undefined. For simplicity, we will refer to the state of A as the node-set represented by the characteristic 
vector stored at A. Similarly, the state of N(R) is the set of characteristic vectors representing S. The state 
of a node is the bit indicating if the node is in S. Since any child A' of A overlap at the nodes Oa' an d <j>A' 
we will insure that the state of 9a 1 and is the same in the characteristic vectors of both A and A' . 

Below we present appropiate move and e-closure operations defined on subautomata. Due to the overlap 
between parent and child nodes these operations take a bit b which will use to propagate the new state of 
the start node. For each subautomaton A, characteristic vector B, bit b, and character a e E define: 

Move A (B 1 b, a): Compute the state B' of all nodes in A reachable via a single a edge from B. If b = 0, 
return B' , else return B' U {Oa}- 

C\ose A (B,b): Return the set B' of all nodes in A reachable via a path of or more e-edges from B, if 
b = 0, or reachable from B U {Oa}, if 6 = 1. 

We will later show how to implement these operations in constant time and total 2°^ space when z = O(fc). 
Before doing so we show how to use these operations to perform the node-set simulation of N(R). Assume 
that the current node-set of N(R) is represented by its characteristic vector for each subautomaton. The 
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following Move and Close operations recursively traverse the hiearchy of subautomata top-down. At each 
subautomata the current state of N(R) is modified using primarily Move" 4 and Close' 4 . For any subautomaton 
A, bit b, and character a € £ define: 

Move(^4, b, a): Let B be the current state of A and let A\,...,Ai be children of A in topological order of 
their start node. 

1. Compute B' := Move- 4 (B, b, a). 

2. For each Ai, 1 < i < I, 

(a) Compute fa := Move(^4i, bi, a), where bi = 1 iff 8 Ai € B' . 

(b) If / j = l setB' :=B'U{<j> Ai }. 

3. Store B' and return the value 1 if (f>A € B' and otherwise. 

Close(j4, 6): Let B be the current state of A and let A\,. . . ,Ai be children of A in topological order of 
their start node. 

1. Compute B' := Close A (B,6). 

2. For each child automaton Ai, 1 < i < I, 

(a) Compute fa := C\ose(Ai, bi), where bi = 1 if 9 Ai € B'. 

(b) E fa = 1 set B' :=B'u{(p Ai }. 

(c) B' := Close A (B,6). 

3. Store B' and return the value 1 if §a € B' and otherwise. 

The "store" in line 3 of both operations updates the state of the subautomaton. The node-set simulation 
of N(R) on string Q of length n produces the states So, ■ ■ ■ ,S n as follows. Let A r be the root automaton. 
Initialize the state of N(R) to be empty, i.e., set all bitvectors to 0. So is computed by calling Close(A r , 1) 
twice. Assume that Sj_i, 1 < j < n, is the current state of N(R) and let a = Q[j}- Compute Sj by calling 
Move(A r , 0, a) and then calling Close(A r ,0) twice. Finally, Q e L(R) iff <j> G S n . 

We argue that the above algorithm is correct. To do this we need to show that the call to the Move 
operation and the two calls to the Close operation simulates the standard Move and Close operations. 

First consider the Move operation. Let S be the state of N(R) and let S' be the state after a call to 
Move(A r , 0, a). Consider any subautomaton A and let B and B' be the bitvectors of A corresponding to 
states S and S', respectively. We first show by induction that after Move(A, 0, a) the new state B' is the set 
of nodes reachable from B via a single a-edge in N(R). For Move(A, l,a) a similar argument shows that 
new state is the union of the set of nodes reachable from B via a single a-edge and 

Initially, we compute B' := Move /l (B, 0, a). Thus B' contains the set of nodes reachable via a single 
a-edge in A. If A is a leaf automaton then B' satisfies the property and the algorithm returns. Otherwise, 
there may be an a-cdgc to some accepting node (pAi of a child automaton Ai . Since this edge is not contained 
A, (f>Ai is not initially in B'. However, since each child is handled recursively in topological order and the new 
state of start and accepting nodes are propagated, it follows that 4>Ai is ultimately added to B'. Note that 
since a single node can be the accepting node of a child Ai and the start node of child A i+1 , the topological 
order is needed to ensure a consistent update of the state. 

It now follows that the state S' of N(R) after Move(A r , 0, a), consists of all nodes reachable via a single 
a-edge from S. Hence, Move(A r ,0, a) correctly simulates a standard Move operation. 

Next consider the two calls to the Close operation. Let S be the state of N(R) and let S' be the state after 
the first call to Close( J 4 r , 0). As above consider any subautomaton A and let B and B' be the bitvectors of A 
corresponding to S and S' , respectively. We show by induction that after Close(A, 0) the state B 1 contains the 
set of nodes in N(R) reachable via a path of or more forward e-edges from B. Initially, B' := Close A (B, 0), 
and hence B' contains all nodes reachable via a path of or more e-edges from B, where the path consists 



84 



solely of edges in A. If A is a leaf automaton, the result immediately holds. Otherwise, there may be a path 
of e-cdges to a node v going through the children of A. As above, the recursive topological processing of the 
children ensures that v is added to B' . 

Hence, after the first call to Close(A r , 0) the state S' contains all nodes reachable from S via a path of 
or more forward e-edges. By a similar argument it follows that the second call to Close(^4 r , 0) produces the 
state S" that contains all the nodes reachable from S via a path of or more forward e-edge and 1 back 
edge. However, by Lemma 32 this is exactly the set of nodes reachable via a path of or more e-edges. 
Furthermore, since Close(yl r ,0) never produces a state with nodes that are not reachable through e-cdges, 
it follows that the two calls to Close(^4 r ,0) correctly simulates a standard Close operation. 

Finally, note that if we start with a state with no nodes, we can compute the state So in the node-set 
simulation by calling Close(A r , 1) twice. Hence, the above algorithm correctly solves Regular Expression 
Matching. 

If the subautomata have size at most z and Move' 4 and Close" 4 can be computed in constant time the 
above algorithm computes a step in the node-set simulation in 0(m/z) time. In the following section we 
show how to do this in 0(2 k ) space for z = Q(k). Note that computing the clustering uses an additional 
0(m) time and space. 



5.2.5 Representing Subautomata 

To efficiently represent Move" 4 and Close" 4 we apply a Four Russians trick. Consider a straightforward code 
for Move" 4 : Precompute the value of Move" 4 for all B, both values of 6, and all characters a. Since the 
number of different bitvectors is 2 Z and the size of the alphabet is a, this table has 2 z+1 a entries. Each 
entry can be stored in a single word, so the table also uses a total of 2 z+1 cr space. The total number of 
subautomata is 0(m/z), and therefore the total size of these tables is an unacceptable 0(J^ ■ 2 z a). 

To improve this we use a more elaborate approach. First we factor out the dependency on the alphabet, 
as follows. For all subautomata A and all characters a G £ define: 

Succ" 4 (_B): Return the set of all nodes in A reachable from B by a single edge. 

Eq A (a): Return the set of all a- nodes in A. 

Since all incoming edges to a node are labeled with the same character it follows that, 



Move" 4 (B, 6, a) 



Succ-^nEq-V) if 6 = 0, 

(Succ A (B) n Eq A (a)) U {9 A } if 6 = 1. 



Hence, given Succ and Eq we can implement Move in constant time using bit operations. To efficiently 
represent Eq" 4 , for each subautomaton A, store the value of Eq" 4 (a) in a hash table. Since the total number of 
different characters in A is at most z the hash table Eq" 4 contains at most z entries. Hence, we can represent 
Eq" 4 for all subautomata is 0(m) space and constant worst-case lookup time. The preprocessing time is 
0(m) w.h.p.. To get a worst-case preprocessing bound we use the deterministic dictionary of [HMP01] with 
0(m log m) worst-case preprocessing time. 

We note that the idea of using Eq" 4 (a) to represent the a-nodes is not new and has been used in several 
string matching algorithms, for instance, in the classical Shift-Or algorithm [BYG92] and in the recent 
optimized DFA construction for regular expression matching [NR04]. 

To represent Succ compactly we proceed as follows. Let A be the automaton obtained by removing the 
labels from edges in A. Succ" 41 and Succ" 42 compute the same function if A\ = A^. Hence, to represent Succ 
it suffices to precompute Succ on all possible subautomata A. By the one-to-one correspondance of parse 
trees and automata we have that each subautomata A corresponds to a parse tree with leaf labels removed. 
Each such parse tree has at most x internal nodes and 2x leaves. The number of rooted, ordered, binary 
trees with at most 3x nodes is less than 2 6x+1 , and for each such tree each internal node can have one of 3 
different labels. Hence, the total number of distinct subautomata is less than 2 6x+1 3 x . Each subautomaton 
has at most 6x nodes and therefore the result of Succ" 4 has to be computed for each of the 2 6x different 
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values for B using 0(x2 6x ) time. Therefore we can precompute all values of Succ in 0(x2 12x+1 3 x ) time. 
Choosing x such that x + 12 °f Q g 3 < 12+Tog 3 &i yes us 0(2 k ) space and preprocessing time. 

Using an analogous argument, it follows that Close' 4 can be precomputed for all distinct subautomata 
within the same complexity. By our discussion in the previous sections and since x = <d(k) we have shown 
the following theorem: 

Theorem 12 For regular expression R of length m, string Q of length n, and k < w, Regular Expression 
Matching can be solved in 0(mn/k + n + m log to) time and 0(2 k + to) space. 



5.3 Approximate Regular Expression Matching 

Given a string Q, a regular expression R, and an integer d > 0, the Approximate Regular Expression 
Matching problem is to determine if Q is within edit distance d to a string in L(R). In this section 
we extend our solution for Regular Expression Matching to Approximate Regular Expression 
Matching. Specifically, we show that the problem can be solved in 0( m " lo s( d + 2 ) _|_ n _|_ m i g m ) time and 
0(2 k + to) space, for any k < w. 



5.3.1 Dynamic Programming Recurrence 

Our algorithm is based on a dynamic programming recurrence due to Myers and Miller [MM89], which we 
describe below. Let A(v,i) denote the minimum over all paths V between 9 and v of the edit distance 
between V and the zth prefix of Q. The recurrence avoids cyclic dependencies from the back edges by 
splitting the recurrence into two passes. Intuitively, the first pass handles forward edges and the second 
pass propagates values from back edges. The pass-1 value of v is denoted Ai(v,i), and the pass-2 value is 
A 2 (u,i). For a given i, the pass-1 (resp. pass-2) value of N(R) is the set of pass-1 (resp. pass-2) values of 
all nodes of N(R). For all v and i, we set A(v,i) — A 2 {v,i). 

The set of predecessors of v is the set of nodes Pre(v) = {w | (w,v) is an edge}. We define Prc(u) = 
{w I (w,v) is a forward edge}. For notational convenience, we extend the definitions of Ai and A2 to 
apply to sets, as follows: Ai(Pre(w),i) — min we p rc ^ Ai(w,i) and Ai(Prc(w), i) = min^p^^ A\{w, i), and 
analogously for A 2 . The pass-1 and pass-2 values satisfy the following recurrence: 



A 2 (<M) = Ai(<M) =i 0<i<n. 

' A 2 (Pre(u) , 0) + 1 if v is a E-node, 
A 2 (Prc(u), 0) if v 7^ 9 is an e-node. 



A 2 (w,0) = Ai(u,0) = min 

For 1 < i < n, 
Ai(«,i) 



min(A 2 0,2 - 1) + 1, A 2 (Pre(t>), i) + X(v, Q[i]), Ai(Pre(v), i) + 1) if v is a S-node, 
Ai(Pre(u), i) if v 7^ 9 is an e-node, 



where X(v, Q[i}) = 1 if v is a Q[i]-node and otherwise, 
A 2 (M) = 



min(Ai(Pre(w), i), A 2 (Pre(u), i)) + 1 if v is a E-node, 
min(Ai(Pre(u), i), A 2 (Pre(u), i)) if v is a e-node. 



A full proof of the correctness of the above recurrence can be found in [MM89, WMM95]. Intuitively, the first 
pass handles forward edges as follows: For S-nodes the recurrence handles insertions, substitution/matches, 
and deletions (in this order). For e- nodes the values computed so far are propagated. Subsequently, the 
second pass handles the back edges. For our problem wc want to determine if Q is within edit distance d. 
Hence, we can replace all values exceeding d by d + 1. 
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5.3.2 Simulating the Recurrence 



Our algorithm now proceeds analogously to the case with d = above. We will decompose the automaton 
into subautomata, and we will compute the above dynamic program on an appropriate encoding of the 
subautomata, leading to a small-space speedup. 

As before, we decompose N(R) into subautomata of size less than z. For a subautomaton A we define 
operations Nextf and Next^ 1 which we use to compute the pass-1 and pass-2 values of A, respectively. 
However, the new (pass-1 or pass-2) value of A depends on pseudo-edges in a more complicated way than 
before: If A' is a child of A, then all nodes preceding (f>A' depend on the value of 4>a' ■ Hence, we need the 
value of <f)A> before we can compute values of the nodes preceding 4>a> ■ To address this problem we partition 
the nodes of a subautomaton as described below. 

For each subautomaton A topologically sort the nodes (ignoring back edges) with the requirement that 
for each child A' the start and accepting nodes 6a' and <f>A' are consecutive in the order. Contracting all 
pseudo-edges in A this can be done for all subautomata in 0(m) time. Let A\, . . . , A\ be the children of A 
in this order. We partition the nodes in A, except {9a} U {4>An ■ ■ ■ , 4>Ai} j hito I + 1 chunks. The first chunk 
is the nodes in the interval [9a + 1, #aJ- If wc let 4>Ai +1 — 4>Ai then the ith chunk, 1 < I < I + 1, is the set 
of nodes in the interval [4 l A i - 1 + 1, &Ai\- A leaf automaton has a single chunk consisting of all nodes except 
the start node. We represent the ith chunk in A by a characteristic vector Lj identifying the nodes in the 
chunks, that is, Li[j] = 1 if node j is in the ith chunk and otherwise. From the topological order we can 
compute all chunks and their corresponding characteristic vectors in total 0(m) time. 

The value of A is represented by a vector B = [b\,.. .,b z ], such that bi G [0, d+ 1]. Hence, the total 
number of bits used to encode B is z [logd-|- 2] bits. For an automaton A, characteristic vectors B and L, 
and a character a G S define the operations Next 1 (B, L, b, a) and Next 2 (B, L, b) as the vectors B\ and B 2 , 
respectively, given by: 

B[v] iiv^L 

jmm(B[v] + 1, B[Pye(v)} + X(v, a), B x [Pre(w)] + 1) if v G L is a S-node, 
1 B\ [Pre(w)] if v G L is an e-node 

B[v] ifvgL 

{min(B[Pre(w)], i?2[Pre(u)] + 1) if v G L is a S-node, 
min(S[Pre(u)], B2[Pre(w)]) if v L is an e-node 



BM = 
Bi[v] = 
B 2 [v] = 
B 2 [v] = 



Importantly, note that the operations only affect the nodes in the chunk specified by L. We will use this 
below to compute new values of A by advancing one chunk at each step. We use the following recursive 
operations: For subautomaton A, integer b, and character a define: 

Nexti(A, b, a): Let B be the current value of A and let A\, . . . , A\ be children of A in topological order of 
their start node. 

1. Set B Y := B and Bi[9 A ] := b. 

2. For each chunk Li, 1 < i < I, 

(a) Compute B\ := Nextf(Si, Li, a). 

(b) Compute f t := Next 1 (A t ,B 1 [9 Al ],a). 

(c) Set B^a,} :=fi. 

3. Compute B\ := Nextf (Bi, Lj+i, a). 

4. Return Bi[(f> A }- 
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Next 2 (A, b): Let B be the current value of A and let A\, . . . , Ai be children of A in topological order of 
their start node. 

1. Set B 2 := B and B 2 [9 A ] ■= b. 

2. For each chunk Li, 1 < i < I, 

(a) Compute B 2 := Next^(S 2 , L,). 

(b) Compute ft := Next 2 (^, B 2 [9 Ai ]). 

(c) Set B 2 [<f> Az } :=fi. 

3. Compute B 2 := Next 2 (B 2 ,Li +1 ). 

4. Return B 2 [(j) A \. 

The simulation of the dynamic programming recurrence on a string Q of length n proceeds as follows: First 
encode the initial values of the all nodes in N(R) using the recurrence. Let A r be the root automaton, let 
Sj-i be the current value of N(R), and let a = Q[j\- Compute the next value Sj by calling Next 1 ( J 4 r , j, a) 
and then Next 2 (A r , j, a). Finally, if the value of (j> in the pass-2 value of S n is less than d, report a match. 

To see the correctness, we need to show that the calls Nexti and Next 2 operations correctly compute the 
pass-1 and pass-2 values of N(R). First consider Nexti, and let A be any subautomaton. The key property 
is that if pi is the pass-1 value of 6 A then after a call to Nexti(j4,pi, a), the value of A is correctly updated 
to the pass-1 value. This follows by a straightforward induction similar to the exact case. Since the pass-1 
value of 8 after reading the jth prefix of Q is j, the correctness of the call to Nexti follows. For Next 2 the 
result follows by an analogous argument. 

Next we show how to efficiently represent Nextf and Next^. First consider Nextf . Note that again the 
alphabet size is a problem. Since the B\ value of a node in A depends on other B\ values in A we cannot 
"split" the computation of Next^j 4 as before. However, the alphabet character only affects the value of X(v, a), 
which is 1 if v is an a-node and otherwise. Hence, we can represent X(v, a) for all nodes in A with Eq A (a) 
from the previous section. Recall that Eq A (a) can be represented for all subautomata in total 0(m) space. 
With this representation the total number of possible inputs to Nextf can be represented using (d+2) z + 2 2z 
bits. Note that for z = log ^ +2 -) we have that (d + 2) z — 2 k . Furthermore, since Nextf is now alphabet 
independent we can apply the same trick as before and only precompute it for all possible parse trees with 
leaf labels removed. It follows that we can choose z = Q( log ^ +2 ) suc h that Nextf can precomputed in total 

0(2 k ) time and space. An analogous argument applies to Next 2 . Hence, by our discussion in the previous 
sections we have shown that, 

Theorem 13 For regular expression R of length m, string Q of length n, and integer d > Approximate 
Regular Expression Matching can be solved in 0( mn lo | (d+2) + n + m log m) time and 0(2 k +ra) space, 
for any k < w. 

5.4 Subsequence Indexing 

The Subsequence Indexing problem is to preprocess a string T to build a data structure supporting 
queries of the form: "is Q a subsequence of T?" for any string Q. This problem was considered by Baeza- 
Yates [BY91] who showed the trade-offs listed in Table 5.1. We assume throughout the section that T 
and Q have lenght n and m, respectively. For properties of automata accepting subsequences of string and 
generalizations of the problem see the recent survey [CMT03] . 

Using recent data structures and a few observations we improve all previous bounds. As a notational 
shorthand, we will say that a data structure with preprocessing time and space f(n, a) and query time 
g(m, n, a) has complexity (f(n, a), g(m, n, a)) 

Let us consider the simplest algorithm for Subsequence Indexing. One can build a DFA of size 0(na) 
for recognizing all subsequences of T. To do so, create an accepting node for each character of T, and for 
node Vi, corresponding to character T[i], create an edge to vj on character a if T[j] is the first a after 



88 



Space 


Preprocessing 


Query 


0{na) 


0(na) 


0(m) 


0(n log cr) 


0(n log a) 


0(m logcr) 


O(n) 


O(n) 


O(mlogn) 



Table 5.1: Trade-offs for Subsequence Indexing. 

position i. The start node has edges to the first occurence of each character. Such an automaton yields an 
algorithm with complexity (0(na), 0(m)). 

An alternative is to build, for each character a, a data structure D a with the positions of a in T. D a 
should support fast successor queries. The D a 's can all be built in a total of linear time and space using, for 
instance, van Emde Boas trees and perfect hashing [vEB77,vEBKZ77,MN90]. These trees have query time 
O (log log n). We use these vEB trees to simulate the above automaton-based algorithm: whenever we are in 
state Vi, and the next character to be read from P is a, we look up the successor of i in D a in O(loglogn) 
time. The complexity of this algorithm is (0(n), 0(m log log n). 

We combine these two data structures as follows: Consider an automaton consisting of nodes u\, . . . , u n / a , 
where node ui corresponds to characters T[a(i — 1), . . . , o~i — 1], that is, each node Uj corresponds to a nodes 
in T. Within each such node, apply the vEB based data structure. Between such nodes, apply the full 
automaton data structure. That is, for node iOj, compute the first occurrence of each character a after 
T[ai — 1]. Call these long jumps. A edge takes you to a node Uj, and as many characters of P are consumed 
with Uj as possible. When no valid edge is possible within Wj, take a long jump. The automaton uses 
■ a) = 0(n) space and preprocessing time. The total size of the vEB data structures is 0(n). Since 
each Ui consist of at most a nodes, the query time is improved to 0(log logcr). Hence, the complexity of this 
algorithm is (O(n), 0(m log log a)). To get a trade-off we can replace the vEB data structures by a recent 
data structure of Thorup [Tho03, Thm. 2]. This data structure supports successor queries of x integers in the 
range [1, X] using 0(xX 1 / 2 ) preprocessing time and space with query time 0(1 + 1), for < I < log log A. 
Since each of the n/a groups of nodes contain at most a nodes, this implies the following result: 

Theorem 14 Subsequence Indexing can be solved in (^0(na 1/2 '),0(m(l + 1))^, for < I < log logcr. 

Corollary 2 Subsequence Indexing can be solved in (0(na e ),0(m)) or (0(n), 0(m log log a)). 

Proof. We set I to be a constant or log logcr, respectively. □ 



5.5 String Edit Distance 

The String Edit Distance problem is to compute the minimum number of edit operations needed to 
transform a string S into a string T. Let m and n be the size of S and T, respectively. The classical solution 
to this problem, due to Wagner and Fischer [WF74], fills in the entries ofanm+lxn + 1 matrix D. The 
entry Dij is the edit distance between S[l..i] and T[l..j], and can be computed using the following recursion: 

A,o = i 
D o,j = J 

Dij = min{A-i,j-i + j), A-i,j + 1, A,j-i + 1} 

where \(i,j) = if S[i] — T[j] and 1 otherwise. The edit distance between S and T is the entry D m ^ n . Using 
dynamic programming the problem can be solved in 0(mn) time. When filling out the matrix we only need 
to store the previous row or column and hence the space used is 0(min(m, n)). For further details, see the 
book by Gusfield [Gus97, Chap. 11]. 
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The best algorithm for this problem, due to Masek and Paterson [MP80], improves the time to + 
m+n) time and 0(2 fc +min(m, n)) space, for any k < w. This algorithm, however, assumes that the alphabet 
size is constant. In this section we give an algorithm using 0( m " fc 1 ° gfc + m + n) time and 0(2 k + min(m, n)) 
space, for any k < w, that works for any alphabet. Hence, we remove the dependency of the alphabet at the 
cost of a log k factor. 

We first describe the algorithm by Masek and Paterson [MP80], and then modify it to handle arbitrary 
alphabets. The algorithm uses a Four Russian Trick. The matrix D is divided into cells of size x x x and 
all possible inputs of a cell is then precomputed and stored in a table. From the above recursion it follows 
that the values inside each cell C depend on the corresponding substrings in S and T, denoted Sc and Tc, 
and on the values in the top row and the leftmost colunm in C. The number of different strings of length 
x is a x and hence there are a 2x possible choices for Sc and Tc- Masek and Paterson [MP80] showed that 
adjacent entries in D differ by at most one, and therefore if we know the value of an entry there are exactly 
three choices for each adjacent entry. Since there are at most m different values for the top left corner of 
a cell it follows that the number of different inputs for the top row and the leftmost column is m3 2x . In 
total, there are at m(a3) 2x different inputs to a cell. Assuming that the alphabet has constant size, we can 
choose x = 9(fc) such that all cells can be precomputed in 0(2 k ) time and space. The input of each cell is 
stored in a single machine word and therefore all values in a cell can be computed in constant time. The 
total number of cells in the matrix is O(^) and hence this implies an algorithm using + m + n) time 

and 0(2 k + min(m,n)) space. 

We show how to generalize this to arbitrary alphabets. The first observation, similar to the idea in 
Section 5.3, is that the values inside a cell C does not depend on the actual characters of Sq and T c , but 
only on the A function on Sc and Tc- Hence, we only need to encode whether or not Sc[i] = Tc[j] for all 
1 < i,j < x. To do this we assign a code c(a) to each character a that appears in Tc or Sc as follows. If 
a only appears in only one of Sc or Tc then c(a) = 0. Otherwise, c(a) is the rank of a in the sorted list of 
characters that appears in both Sc and Tc- The representation is given by two vectors Sc and Tc of size 
x, where S c [i\ = c{S c {i}) and f c [i] = c(T c \i}), for all i, 1 < i < x. Clearly, S c [i\ = T c \j] iff &[*] = T c \j] 
and Sc[i] > and Tc[j] > and hence Sc and Tc suffices to represent A on C. 

The number of characters appearing in both Tc and Sc is at most x and hence each entry of the vectors 
is assigned an integer value in the range [l,a;]. Thus, the total number of bits needed for both vectors is 
2x \logx + 1] . Hence, we can choose x = such that the vectors for a cell can be represented in a 

single machine word. It follows that if all vectors have been precomputed we get an algorithm for String 
Edit Distance using 0( m " fc 1 ° gfc + m + n) time and 0(2 k + min(m, n)) space. 

Next we show how to compute vectors efficiently. Given any cell C, we can identify the characters 
appearing in both Sc and Tc by sorting Sc and then for each index i in Tc use a binary search to see 
if Tc[i] appears in Sc- Next we sort the characters appearing in both substrings and insert their ranks 
into the corresponding positions in Sc and Tc- All other positions in the vectors are given the value 0. 
This algorithm uses 0{x\ogx) time for each cell. However, since the number of cells is O(^) the total 
time becomes 0( " TOlogx ), which for our choice of x is Q( nm ( 1 °B k ) ). To improve this we group the cells 
into macro cells of y x y cells. We then compute the vector representation for each of these macro cells. 
The vector representation for a cell C is now the corresponding subvectors of the macro cell containing C. 
Hence, each vector entry is now in the range [0, . . . , xy] and thus uses \\og(xy + 1)] bits. Computing the 
vector representation uses 0(xy log(xy)) time for each macro cell and since the number of macro cells is 
°( S) thc total time t0 compute it is 0{ nmXo x f xy) +m + n). It follows that we can choose y = k log k and 
x = ©(i^jj) such that vectors for a cell can be represented in a single word. Furthermore, with this choice 

of x and y all vectors are computed in 0( " m fc 1 ° gfc + m + n) time. Combined with the time used to compute 
the distance we have shown: 

Theorem 15 For strings S and T of length n and m, respectively, String Edit Distance can be solved 
in 0( m " fc 1 ° gfc + m + n) time and 0(2 k + min(m, n)) space. 
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Abstract 

In this paper we revisit the classical regular expression matching problem, namely, given a regular 
expression R and a string Q, decide if Q matches one of the strings specified by R. Let m and n be the 
length of R and Q, respectively. On a standard unit-cost RAM with word length w > logn, we show 
that the problem can be solved in 0(m) space with the following running times: 

if m > w 
if y/w < m < w 
if rn < \/w. 

This improves the best known time bound among algorithms using O(m) space. Whenever w > log 2 n 
it improves all known time bounds regardless of how much space is used. 

6.1 Introduction 

Regular expressions are a powerful and simple way to describe a set of strings. For this reason, they are often 
chosen as the input language for text processing applications. For instance, in the lexical analysis phase of 
compilers, regular expressions are often used to specify and distinguish tokens to be passed to the syntax 
analysis phase. Utilities such as Grep, the programming language Perl, and most modern text editors provide 
mechanisms for handling regular expressions. These applications all need to solve the classical Regular 
Expression Matching problem, namely, given a regular expression R and a string Q, decide if Q matches 
one of the strings specified by R. 

The standard textbook solution, proposed by Thompson [Tho68] in 1968, constructs a non- deterministic 
finite automaton (NFA) accepting all strings matching R. Subsequently, a state-set simulation checks if 
the NFA accepts Q. This leads to a simple 0(nm) time and 0(m) space algorithm, where m and n are 
the number of symbols in R and Q, respectively. The full details are reviewed later in Sec. 6.2 and can 
found in most textbooks on compilers (e.g. Aho et. al. [ASU86]). Despite the importance of the problem, 
it took 24 years before the 0{nm) time bound was improved by Myers [Mye92a] in 1992, who achieved 
C^Togn + (n + m)logn) time and O(j^) space. For most values of m and n this improves the 0(nm) 
algorithm by a O(logn) factor. Currently, this is the fastest known algorithm. Recently, Bille and Farach- 
Colton [BFC05] showed how to reduce the space of Myers' solution to 0(n). Alternatively, they showed how 
to achieve a speedup of O(logm) over Thompson's algorithm while using 0{m) space. These results are 
all valid on a unit-cost RAM with w-bit words and a standard instruction set including addition, bitwise 
boolean operations, shifts, and multiplication. Each word is capable of holding a character of Q and hence 
w > logn. The space complexities refer to the number of words used by the algorithm, not counting the 
input which is assumed to be read-only. All results presented here assume the same model. In this paper we 
present new algorithms achieving the following complexities: 



0(n +m log w) 

0(n log m + m log m) 
0(min(n + m 2 , nlog m + mlogm)) 
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Theorem 16 Given a regular expression R and a string Q of lengths to and n, respectively, REGULAR 
Expression Matching can be solved using 0{m) space with the following running times: 

if to > w 
if y/w < m < w 
if m < y/w. 

This represents the best known time bound among algorithms using 0(m) space. To compare these with 
previous results, consider a conservative word length of w — logn. When the regular expression is "large", 
e.g., m > logn, we achieve an O( lo lo ^ w ) factor speedup over Thompson's algorithm using 0(m) space. 
Hence, we simultaneously match the f>est known time and space bounds for the problem, with the exception 
of an O(loglogn) factor in time. More interestingly, consider the case when the regular expression is 
"small", e.g., m = O(logn). This is usually the case in most applications. To beat the O(nlogn) time of 
Thompson's algorithm, the fast algorithms [Mye92a, BFC05] essentially convert the NFA mentioned above 
into a deterministic finite automaton (DFA) and then simulate this instead. Constructing and storing the 
DFA incurs an additional exponential time and space cost in to, i.e., 0(2 m ) = 0(n) (see [WM92b,NR04] for 
compact DFA representations). However, the DFA can now be simulated in 0(n) time, leading to an O(n) 
time and space algorithm. Surprisingly, our result shows that this exponential blow-up in to can be avoided 
with very little loss of efficiency. More precisely, we get an algorithm using 0(n log logn) time and O(logn) 
space. Hence, the space is improved exponentially at the cost of an O(loglogn) factor in time. In the case 
of an even smaller regular expression, e.g., m = O(Vlogn), the slowdown can be eliminated and we achieve 
optimal 0(n) time. For larger word lengths our time bounds improve. In particular, when w > logn log logn 
the bound is better in all cases, except for y/w < to < w, and when w > log 2 n it improves all known time 
bounds regardless of how much space is used. 

The key to obtain our results is to avoid explicitly converting small NFAs into DFAs. Instead we show 
how to effectively simulate them directly using the parallelism available at the word-level of the machine 
model. The kind of idea is not new and has been applied to many other string matching problems, most 
famously, the Shift-Or algorithm [BYG92] , and the approximate string matching algorithm by Myers [Mye99] . 
However, none of these algorithms can be easily extended to Regular Expression Matching. The main 
problem is the complicated dependencies between states in an NFA. Intuitively, a state may have long 
paths of e-transitions to a large number of other states, all of which have to be traversed in parallel in the 
state-set simulation. To overcome this problem we develop several new techniques ultimately leading to 
Theorem 16. For instance, we introduce a new hierarchical decomposition of NFAs suitable for a parallel 
state-set simulation. We also show how state-set simulations of large NFAs efficiently reduces to simulating 
small NFAs. 

The results presented in this paper are primarily of theoretical interest. However, we believe that most of 
the ideas are useful in practice. The previous algorithms require large tables for storing DFAs, and perform 
a long series of lookups in these tables. As the tables become large we can expect a high number of cache- 
misses during the lookups, thus limiting the speedup in practice. Since we avoid these tables, our algorithms 
do not suffer from this defect. 

The paper is organized as follows. In Sec. 6.2 we review Thompson's NFA construction, and in Sec. 6.3 
we present the above mentioned reduction. In Sec. 6.4 we present our first simple algorithm for the problem 
which is then improved in Sec. 6.5. Combining these algorithms with our reduction leads to Theorem 16. 
We conclude with a couple of remarks and open problems in Sec. 6.6. 

6.2 Regular Expressions and Finite Automata 

In this section we briefly review Thompson's construction and the standard state-set simulation. The set of 
regular expressions over an alphabet £ are defined recursively as follows: 

• A character a e £ is a regular expression. 



0(n=^ + to log w) 
0{n log to + to log to) 
0(min(n + TO 2 ,nlogm + mlogm)) 
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Figure 6.1: Thompson's NFA construction. The regular expression for a character a £ £ corresponds to 
NFA (a). If S and T are regular expressions then N(ST), N(S\T), and N(S*) correspond to NFAs (6), (c), 
and (d), respectively. Accepting nodes are marked with a double circle. 

• If £ and T are regular expressions then so is the concatenation, (S) ■ (T), the union, (S)\(T), and the 
star, (S)*. 

Unnecessary parentheses can be removed by observing that • and | is associative and by using the standard 
precedence of the operators, that is * precedes •, which in turn precedes |. We often remove the • when 
writing regular expressions. 

The language L{R) generated by R is the set of all strings matching R. The parse tree T(R) of R is the 
binary rooted tree representing the hiearchical structure of R. Each leaf is labeled by a character in £ and 
each internal node is labeled either •, |, or *. A finite automaton is a tuple A = (V, E, S, 9, <j)), where 

• V is a set of nodes called states, 

• E is set of directed edges between states called transitions, 

• 5 : E — > £ U {e} is a function assigning labels to transitions, and 

• 9,<j) £ F are distinguished states called the start state and accepting state, respectively 1 . 

Intuitively, A is an edge-labeled directed graph with special start and accepting nodes. A is a deterministic 
finite automaton (DFA) if A does not contain any e-transitions, and all outgoing transitions of any state have 
different labels. Otherwise, A is a non- deterministic automaton (NFA). We say that A accepts a string Q if 
there is a path from 9 to 4> such that the concatenation of labels on the path spells out Q. Thompson [Tho68] 
showed how to recursively construct a NFA N{R) accepting all strings in L(R). The rules are presented 
below and illustrated in Fig. 6.1. 

• N(a) is the automaton consisting of states 9 a , 4> a , and an a-transition from 9 a to cf> a - 

• Let N(S) and N(T) be automata for regular expressions S and T with start and accepting states 9s, 
9 T , 4> s , and 4> T , respectively. Then, NFAs N(S ■ T), N{S\T), and N{S*) arc constructed as follows: 

N(ST): Add start state 9st and accepting state 4>st, and e-transitions (9st,9s), (^5,^t), and 
(cj> T ,4> ST ). 

N(S\T): Add start state Q S \ T and accepting state (j>s\T, and add e-transitions (9s\t,9s), {9s\t,9t), 
{<l>S,<f>S\T), and ((j)T,<j>s\T)- 

N(S*): Add a new start state 9s* and accepting state 4>s* , and e-transitions {9s»,9s), (9s*,<j>s*), 
{<PsAs"), and Os,6» s ). 

1 Sometimes NFAs are allowed a set of accepting states, but this is not necessary for our purposes. 
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Readers familiar with Thompson's construction will notice that N(ST) is slightly different from the usual 
construction. This is done to simplify our later presentation and does not affect the worst case complexity of 
the problem. Any automaton produced by these rules we call a Thompson-NFA (TNFA). By construction, 
N(R) has a single start and accepting state, denoted 9 and <f>, respectively. 9 has no incoming transitions and 
<j> has no outgoing transitions. The total number of states is 2m and since each state has at most 2 outgoing 
transitions that the total number of transitions is at most 4m. Furthermore, all incoming transitions have 
the same label, and we denote a state with incoming a-transitions an a-state. Note that the star construction 
in Fig. 6.1(d) introduces a transition from the accepting state of N(S) to the start state of N(S). All such 
transitions are called back transitions and all other transitions are forward transitions. We need the following 
property. 

Lemma 34 (Myers [Mye92a]) Any cycle-free path in a TNFA contains at most one back transition. 

For a string Q of length n the standard state-set simulation of N(R) on Q produces a sequence of state-sets 
So, ■ ■ ■ , S n . The ith set Si, < i < n, consists of all states in N(R) for which there is a path from 9 that 
spells out the ith prefix of Q. The simulation can be implemented with the following simple operations. For 
a state-set S and a character aeE, define 

Move(5, a): Return the set of states reachable from S via a single a-transition. 
Close(S'): Return the set of states reachable from S via or more e-transitions. 

Since the number of states and transitions in N(R) is 0(m), both operations can be easily implemented in 
O(m) time. The Close operation is often called an e-closure. The simulation proceeds as follows: Initially, 
S Q := Close({<9}). If Q[j] = a, 1 < j < n, then Sj := Close(Move(5 j _i, a)). Finally, Q e L(R) iff € S n . 
Since each state-set Sj only depends on Sj-i this algorithm uses 0(mn) time and 0(m) space. 

6.3 From Large to Small TNFAs 

In this section we show how to simulate N(R) by simulating a number of smaller TNFAs. We will use this 
to achieve our bounds when R is large. 

6.3.1 Clustering Parse Trees and Decomposing TNFAs 

Let R be a regular expression of length m. We first show how to decompose N(R) into smaller TNFAs. 
This decomposition is based on a simple clustering of the parse tree T(R). A cluster C is a connected 
subgraph of T(R) and a cluster partition CS is a partition of the nodes of T{R) into node-disjoint clusters. 
Since T(R) is a binary tree with 0(m) nodes, a simple top-down procedure provides the following result (see 
e.g. [Mye92a]): 

Lemma 35 Given a regular expression R of length m and a parameter x, a cluster partition CS of T{R) 
can be constructed in 0{m) time such that \CS\ — 0(\m/x]), and for any C G CS, the number of nodes in 
C is at most x. 

For a cluster partition CS, edges adjacent to two clusters are external edges and all other edges are internal 
edges. Contracting all internal edges in CS induces a macro tree, where each cluster is represented by a 
single macro node. Let C v and C w be two clusters with corresponding macro nodes v and w. We say that 
C v is the parent cluster (resp. child cluster) of C w if v is the parent (resp. child) of w in the macro tree. 
The root cluster and leaf clusters are the clusters corresponding to the root and the leaves of the macro tree. 
An example clustering of a parse tree is shown in Fig. 6.2(b). Given a cluster partition CS of T(R) we show 
how to divide N(R) into a set of small nested TNFAs. Each cluster C € CS will correspond to a TNFA A, 
and we use the terms child, parent, root, and leaf for the TNFAs in the same way we do with clusters. For 
a cluster C € CS with children C\, . . . ,C\, insert a special pseudo-node Pi, 1 < i < I, in the middle of the 
external edge connecting C with Cj. We label each pseudo-node by a special character [3 £ S. Let Tq be 
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Figure 6.2: (a) The parse tree for the regular expression ac\a*b. (b) A clustering of (a) into node-disjoint 
connected subtrees C\, C2, and C3, each with at most 3 nodes, (c) The clustering from (b) extended with 
pseudo-nodes, (d) The nested decomposition of N(ac\a*b). (e) The TNFA corresponding to C\. 

the tree induced by the set of nodes in C and {pi, . . . ,pi}. Each leaf in Tq is labeled with a character from 
£ U {/?}, and hence Tc is a well-formed parse tree for some regular expression Rq over S U {/?}. Now, the 
TNFA j4 corresponding to C is N(Rc). In A, child TNFA is represented by its start and accepting state 
9 Ai and 4>Ai and a pseudo-transition labeled /3 connecting them. An example of these definitions is given in 
Fig. 6.2. We call any set of TNFAs obtained from a cluster partition as above a nested decomposition AS of 
N(R). 

Lemma 36 Given a regular expression R of length m and a parameter x, a nested decomposition AS of 
N(R) can be constructed in 0(m) time such that \AS\ = 0(\m/x]), and for any A e AS, the number of 
states in A is at most x. 

Proof. Construct the parse tree T(R) for R and build a cluster partition CS according to Lemma 35 
with parameter y — j ~ \- From CS build a nested decomposition AS as described above. Each C G CS 
corresponds to a TNFA A e AS and hence \AS\ = 0{\m/y\) = 0(\m/x\). Furthermore, if \V{C)\ < y we 
have |V(Tc)| < 2y + 1. Each node in Tc contributes two states to the corresponding TNFA A, and hence 
the total number of states in A is at most Ay + 2 = x. Since the parse tree, the cluster partition, and the 
nested decomposition can be constructed in 0(m) time the result follows. □ 



6.3.2 Simulating Large Automata 

We now show how N(R) can be simulated using the TNFAs in a nested decomposition. For this purpose we 
define a simple data structure to dynamically maintain the TNFAs. Let AS be a nested decomposition of 
N(R) according to Lemma 36, for some parameter x. Let A £ AS be a TNFA, let Sa be a state-set of A, 
let s be a state in A, and let a G S. A simulation data structure supports the 4 operations: Move a(S a, 
Close^S^), Member a(Sa, s ), an d Insert^S^, s). Here, the operations Move^ and Close^ arc defined exactly 
as in Sec. 6.2, with the modification that they only work on A and not N(R). The operation Member^S^, s) 
returns yes if s € Sa and no otherwise and lnsertA(5U, s) returns the set Sa U {s}. 
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In the following sections we consider various efficient implementations of simulation data structures. For 
now assume that we have a black-box data structure for each A e AS. To simulate N(R) we proceed as 
follows. First, fix an ordering of the TNFAs in the nested decomposition AS, e.g., by a preorder traversal 
of the tree represented given by the parent/child relationship of the TNFAs. The collection of state-sets for 
each TNFA in AS are represented in a state-set array X of length \AS\. The state-set array is indexed by 
the above numbering, that is, X[i] is the state-set of the zth TNFA in AS. For notational convenience we 
write X [A] to denote the entry in X corresponding to A. Note that a parent TNFA share two states with 
each child, and therefore a state may be represented more than once in A. To avoid complications we will 
always assure that X is consistent, meaning that if a state s is included in the state-set of some TNFA, then 
it is also included in the state-sets of all other TNFAs that share s. If S = \J AeAS X[A] we say that X 
models the state-set S and write S = X. 

Next we show how to do a state-set simulation of N(R) using the operations Movers and Closers, which 
we define below. These operations recursively update a state-set array using the simulation data structures. 
For any A £ AS, state-set array X, and a £ E define 



Move AS (A, X,a): 



1. X[A] := Move A (X[A],a) 

2. For each child A, of A in topological order do 

(a) X := Move A s (A t ,X, a) 

(b) If (f) Az e X[Ai] then X[A] := \nsen A (X[A], <f> Ai ) 

3. Return X 



Close A s (A, X): 1. X[A] := C\ose A (X[A}) 

2. For each child A, of A in topological order do 

(a) If 9 Ai e X[A] then X[A t ] := \nsert Az {X[A t ],e Az ) 

(b) X := Closers (A;, A") 

(c) If <j) Az e X[Ai] then X[A] := lnsert A (AL4], <f> Ai ) 

(d) X[A] := Close A (X[A]) 

3. Return X 



The Movers and Closers operations recursively traverses the nested decomposition top-down processing 
the children in topological order. At each child the shared start and accepting states are propagated in the 
state-set array. For simplicity, we have written Member^ using the symbol G. 

The state-set simulation of N(R) on a string Q of length n produces the sequence of state-set arrays 
Xq, . . . , X n as follows: Let A r be the root automaton and let X be an empty state-set array (all entries in 
X are 0). Initially, set AL4 r ] := Insert Ar (X[A r ], 9 Ar ) and compute X := Close A s{A r , Close A s (A r , X j). For 
i > we compute Xi from Aj_i as follows: 



Xi := C\ose A s{A r ,C\ose A s{A r ,Move AS (A r ,Xi--L,Q[i\))) 



Finally, we output Q € L(R) iff <fi Ar € A„^4 r ]. To see that this algorithm correctly solves Regular 
Expression Matching it suffices to show that for any i, < i < n, Xi correctly models the ith state-set 
Si in the standard state-set simulation. We need the following lemma. 

Lemma 37 Let X be a state-set array and let A r be the root TNFA in a nested decomposition AS. If S is 
the state-set modeled by X, then 

• Move(S,a) = Move A s(A r , X,a) and 

• Close(5) = C\ose AS {A r , Close AS (A r ,X)). 



<)<S 



Proof. First consider the Movers operation. Let A be the TNFA induced by all states in A and descendants 
of A in the nested decomposition, i.e., A is obtained by recursively "unfolding" the pseudo-states and 
pseudo-transitions in A, replacing them by the TNFAs they represent. We show by induction that the 
state-array X' A := Move as (A, X, a) models Move(>S, a) on A. In particular, plugging in A = A r , we have 
that Move as {A r ,X, a) models Move(S, a) as required. 

Initially, line 1 updates X[A] to be the set of states reachable from a single a-transition in A. If A is 
a leaf, line 2 is completely bypassed and the result follows immediately. Otherwise, let Ai,.. ., A\ be the 
children of A in topological order. Any incoming transition to a state Oa* or outgoing transition from a 
state <pAi is an e-transition by Thompson's construction. Hence, no endpoint of an a-transition in A can 
be shared with any of the children A\, . . . , A\. It follows that after line 1 the updated X[A) is the desired 
state-set, except for the shared states, which have not been handled yet. By induction, the recursive calls 
in line 2(a) handle the children. Among the shared states only the accepting ones, (j>A 1 , • ■ • , <j>A n may be the 
endpoint of an a-transition and therefore line 2(b) computes the correct state-set. 

The Closers operation proceeds in a similar, though slightly more complicated fashion. Let Xa be the 
state-array modeling the set of states reachable via a path of forward e-transitions in A, and let Xa be the 
state array modelling Close(S') in A. We show by induction that if X A :— Closers (A, X) then 

Xa C X" a C Xa, 

where the inclusion refers to the underlying state-sets modeled by the state-set arrays. Initially, line 1 updates 
X[A] := Closer (AL4J). If A is a leaf then clearly X A = X A - Otherwise, let Ai,...,Ai be the children of A 
in topological order. Line 2 recursively update the children and propagate the start and accepting states in 
(a) and (c). Following each recursive call we again update X[A] := Closer (XL4]) in (d). No state is included 
in X" A if there is no e-path in A or through any child of A. Furthermore, since the children are processed in 
topological order it is straightforward to verify that the sequence of updates in line 2 ensure that X A contain 
all states reachable via a path of forward e-transitions in A or through a child of A. Hence, by induction we 
have Xa C X" a C Xa as desired. 

A similar induction shows that the state-set array C\oseAS'(A r , X") models the set of states reachable from 
X" using a path consisting of forward e-transitions and at most 1 back transition. However, by Lemma 34 
this is exactly the set of states reachable by a path of e-transitions. Hence, Q\oseAs{A r , X") models Close(5) 
and the result follows. □ 



By Lemma 37 the state-set simulation can be done using the Closers and Movers operations and the 
complexity now directly depends on the complexities of the simulation data structure. Putting it all together 
the following reduction easily follows: 

Lemma 38 Let R be a regular expression of length m over alphabet E and let Q a string of length n. Given 
a simulation data structure for TNFAs with x < m states over alphabet E U {f3}, where /3 ^ E, that supports 
all operations inO(t(x)) time, using 0(s(x)) space, and 0(p(x)) preprocessing time, Regular Expression 
Matching for R and Q can be solved in 0( nm < x ) + HL£M) time usmg o(Hi^M) space 

Proof. Given R first compute a nested decomposition ^45* of N(R) using Lemma 36 for parameter x. For 
each TNFA A G AS sort ^4's children to topologically and keep pointers to start and accepting states. By 
Lemma 36 and since topological sort can be done in 0(m) time this step uses 0(m) time. The total space 
to represent the decomposition is 0{m). Each A e AS is a TNFA over the alphabet E U {/?} with at most 
x states and \AS\ = 0{ — ). Hence, constructing simulation data structures for all A e AS uses 0( mp ^ ) 
time and 0( ms ^ ) space. With the above algorithm the state-set simulation of N(R) can now be done in 
Q^ nmt(x) ^ y me ^ yigiding t ne desired complexity. □ 

The idea of decomposing TNFAs is also present in Myers' paper [Mye92a], though he does not give a 
"black-box" reduction as in Lemma 38. We believe that the framework provided by Lemma 38 helps to 
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simplify the presentation of the algorithms significantly. We can restate Myers' result in our setting as the 
existence of a simulation data structure with 0(1) query time that uses 0(x ■ 2 X ) space and preprocessing 
time. For x < log(n/ logn) this achieves the result mentioned in the introduction. The key idea is to encode 
and tabulate the results of all queries (such an approach is frequently referred to as the "Four Russian 
Technique" [ADKF70]). Bille and Farach [BFC05] give a more space-efficient encoding that does not use 
Lemma 38 as above. Instead they show how to encode all possible simulation data structures in total 
0(2 X + to) time and space while maintaining 0(1) query time. 

In the following sections we show how to efficiently avoid the large tables needed in the previous ap- 
proaches. Instead we implement the operations of simulation data structures using the word-level parallelism 
of the machine model. 

6.4 A Simple Algorithm 

In this section we present a simple simulation data structure for TNFAs, and develop some of the ideas for 
the improved result of the next section. Let A be a TNFA with to = O(^vo) states. We will show how to 
support all operations in 0(1) time using 0(m) space and 0(to 2 ) preprocessing time. 

To build our simulation data structure for A, first sort all states in A in topological order ignoring the 
back transitions. We require that the endpoints of an a-transition are consecutive in this order. This is 
automatically guaranteed using a standard 0(to) time algorithm for topological sorting (see e.g. [CLRS01]). 
We will refer to states in A by their rank in this order. A state-set of A is represented using a bitstring 
S = S1S2 ■ ■ ■ s m defined such that Si = 1 iff node i is in the state-set. The simulation data structure consists 
of the following bitstrings: 

• For each a £ S, a string D a = d\ . . . d m such that di = 1 iff i is an a-state. 

• A string E = Oei^ei^ . . . ei jm 0e2,ie2,2 • ■ • e2, TO ■ • • Oe mi ie mi 2 • ■ ■ e m>m , where eij = 1 iff i is e-reachable 
from j. The zeros are test bits needed for the algorithm. 

• Three constants I = (10 m ) m , X = l(O m l) m -\ and C = l(O m - 1 l) m - 1 . Note that I has a 1 in each 
test bit position 2 . 

The strings E, I, X, and C are easily computed in 0(to 2 ) time and use 0(to 2 ) bits. Since to = 0(y/w) 
only O(l) space is needed to store these strings. We store D a in a hashtable indexed by a. Since the total 
number of different characters in A can be at most to, the hashtable contains at most to entries. Using 
perfect hashing D a can be represented in 0(m) space with 0(1) worst-case lookup time. The preprocessing 
time is expected 0(m) w.h.p.. To get a worst-case bound we use the deterministic dictionary of Hagerup 
ct. al. [HMP01] with O(TOlogm) worst-case preprocessing time. In total the data structure requires 0(m) 
space and 0(m 2 ) preprocessing time. 

Next we show how to support each of the operations on A. Suppose S = si . . . s m is a bitstring repre- 
senting a state-set of A and a€S. The result of Movers', ot) is given by 

S' := (S » 1)&D«. 

This should be understood as C notation, where the right-shift is unsigned. Readers familiar with the Shift- 
Or algorithm [BYG92] will notice the similarity. To see the correctness, observe that state i is put in S' 
iff state (i — 1) is in S and the ith state is an a-state. Since the endpoints of a-transitions are consecutive 
in the topological order it follows that S' is correct. Here, state (i — 1) can only influence state i, and this 
makes the operation easy to implement in parallel. However, this is not the case for Close/i. Here, any state 
can potentially affect a large number of states reachable through long e-paths. To deal with this we use the 

2 We use exponentiation to denote repetition, i.e., 1 3 = 1110. 
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following steps. 



Y:=(SxX)kE 

Z := ((Y | 7) - (I » to)) & J 

5" := ((Z x C) « w - m(m + 1)) >> w - m 

We describe in detail why this, at first glance somewhat cryptic sequence, correctly computes 5" as the result 
of Closer (S). The variables Y and Z are simply temporary variables inserted to increase the readability of 
the computation. Let S — s\ . . . s m . Initially, S x X concatenates to copies of S with a zero bit between 
each copy, that is, 

SxX = Sl ...s m x l(O m l) T "- 1 = (Osi . . . s m ) m . 

The bitwise & with E gives 

Y = 0j/i,iJ/i,2 • • • 2/l,m0t/2,l2/2,2 ■ • ■ 2/2, mO . . . Oj/ m ,l2/ m ,2 ■ ■ ■ y m ,m, 

where yij = 1 iff state j is in S and state i is e-reachable from j. In other words, the substring Y = • • • 2/i,m 
indicates the set of states in 5 that have a path of e-transitions to i. Hence, state i should be included in 
Closers') precisely if at least one of the bits in Yi is 1. This is determined next. First (Y\I) — (I >> m) sets 
all test bits to 1 and subtracts the test bits shifted right by to positions. This ensures that if all positions in 
Yi are 0, the zth test bit in the result is and otherwise 1. The test bits are then extracted with a bitwise & 
with I, producing the string Z = z 1 m z 2 m . . . z m m . This is almost what we want since Zi = 1 iff state i is 
in Close^S). The final computation compresses the Z into the desired format. The multiplication produces 
the following length 2m 2 string: 

ZxC= z 1 m z 2 m . . . z m m x l(0 ro - 1 l) ro " 1 

= ziO m 1 ziZ20 m 2 • • • z\ . . . ZkO m ■ ■ ■ z\ . . . z m -i0z\ . . . z m 0z2 ■ ■ ■ z m • • • k Zk+i ■ ■ ■ z m ■ ■ ■ m 1 z m m 

In particular, positions to(to — 1) + 1 through to 2 (from the left) contain the test bits compressed into a 
string of length to. The two shifts zeroes all other bits and moves this substring to the rightmost position in 
the word, producing the final result. Since to = 0(y/w) all of the above operations can be done in constant 
time. 

Finally, observe that Insert^ and Member^ are trivially implemented in constant time. Thus, 

Lemma 39 For any TNFA with m = 0(s/w) states there is a simulation data structure using 0(m) space 
and 0(to 2 ) preprocessing time which supports all operations in 0(1) time. 

The main bottleneck in the above data structure is the string E that represents all e-paths. On a TNFA 
with to states E requires at least to 2 bits and hence this approach only works for to = 0{^/w). In the next 
section we show how to use the structure of TNFAs to do better. 

6.5 Overcoming the e-closure Bottleneck 

In this section we show how to compute an e-closure on a TNFA with m = 0(w) states in O(logTO) time. 
Compared with the result of the previous section we quadratically increase the size of the TNFA at the 
expense of using logarithmic time. The algorithm is easily extended to an efficient simulation data structure. 
The key idea is a new hierarchical decomposition of TNFAs described below. 

6.5.1 Partial-TNFAs and Separator Trees 

First we need some definitions. Let A be a TNFA with parse tree T. Each node v in T uniquely corresponds 
to two states in A, namely, the start and accepting states 0a> and 4>a' of the TNFA A' with the parse tree 
consisting of v and all descendants of v. We say v associates the states S(v) = {0a>, 4>A'}- In general, if C 
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is a cluster of T, i.e., any connected subgraph of T, we say C associates the set of states S(C) — U v< =cS{v). 
We define the partial-TNFA (pTNFA) for C, as the directed, labeled subgraph of A induced by the set of 
states S(C). In particular, A is a pTNFA since it is induced by S(T). The two states associated by the 
root node of C are defined to be the start and accepting state of the corresponding pTNFA. We need the 
following result. 

Lemma 40 For any pTNFA P with m > 2 states there exists a partitioning of P into two subgraphs Po 
and Pi such that 

(i) Po and Pi are pTNFAs with at most 2/3m + 2 states each, 

(ii) any transition from Po to Pi ends in Op, and any transition from Pi to Po starts in 4>Pi, an d 
(iii) the partitioning can be computed in 0(m) time. 

Proof. Let P be pTNFA with m > 2 states and let C be the corresponding cluster with t nodes. Since 
C is a binary tree with more than 1 node, Jordan's classical result [Jor69] establishes that we can find in 
0(t) time an edge e in C whose removal splits C into two clusters each with at most 2/3t + 1 nodes. These 
two clusters correspond to two pTNFAs, Po and Pj, and since m = 2t each of these have at most 2/3m + 2 
states. Hence, (i) and (iii) follows. For (ii) assume w.l.o.g. that Po is the pTNFA containing the start and 
accepting state of P, i.e., 0p o = Op and <pp Q = 4>p. Then, Po is the pTNFA obtained from P by removing 
all states of Pj. From Thompson's construction it is easy to check that any transition from Po to Pi ends 
in 8pj and any transition from Pj to Po must start in 0p 7 . □ 



Intuitively, if we draw P, Pi is "surrounded" by Po, and therefore we will often refer to Pj and Po as the 
inner pTNFA and the outer pTNFA, respectively (see Fig. 6.3(a)). Applying Lemma 40 recursively gives the 
following essential data structure. Let P be a pTNFA with m states. The separator tree for P is a binary, 
rooted tree B defined as follows: If m = 2, i.e., P is a trivial pTNFA consisting of two states Op and <j>p, 
then B is a single leaf node v that stores the set X(v) = {9p, 4>p}. Otherwise (m > 2), compute Po and Pi 
according to Lemma 40. The root v of B stores the set X(v) = {0p T , (f>pj}, and the children of v are roots 
of separator trees for Po and Pi, respectively (see Fig. 6.3(b)). 

With the above construction each node in the separator tree naturally correspond to a pTNFA, e.g., 
the root corresponds to P, the children to Pj and Po, and so on. We denote the pTNFA corresponding to 
node v in B by P{v). A simple induction combined with Lemma 40(i) shows that if v is a node of depth k 
then P(v) contains at most (|) fe m + 6 states. Hence, the depth of B is at most d = log 3 / 2 m + 0(1). By 
Lemma 40 (iii) each level of B can be computed in 0(m) time and thus B can be computed in 0(m log m) 
total time. 
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6.5.2 A Recursive e-Closure Algorithm 



We now present a simple e-closure algorithm for a pTNFA, which recursively traverses the separator tree B. 
We first give the high level idea and then show how it can be implemented in 0(1) time for each level of B. 
Since the depth of B is O(logra) this leads to the desired result. For a pTNFA P with m states, a separator 
tree B for P, and a node v in B define 

Close P ( t ,)(S'): 1. Compute the set Z C X(v) of states in X(v) that are e-reachable from £ in P(v). 

2. If v is a leaf return S' := Z, else let u and w be the children of v, respectively: 

(a) Compute the set G C V(P(v)) of states in P(v) that are e-reachablc from Z. 

(b) Return S' := C\ose P{u) {{S U G) n V(P(u))) U C\ose P(w) {{S U G) n V(P(tu))). 

Lemma 41 Por ont/ node t; m i/ie separator tree of a pTNFA P, Closep(„) (S) computes the set of states in 
P(v) reachable via a path of e-transitions. 

Proof. Let S be the set of states in P(v) reachable via a path of e-transitions. We need to show that S = S'. 
It is easy to check that any state in S' is reachable via a path of e-transitions and hence S' C S. We show 
the other direction by induction on the separator tree. If v is leaf then the set of states in P(v) is exactly 
X(v). Since S' = Z the claim follows. Otherwise, let u and w be the children of v, and assume w.l.o.g. that 
X(v) = {6p( u ), <t>p(u)}- Consider a path p of e-transitions from state s to state s' . There are two cases to 
consider: 

Case 1: s' G V(P(u)). If p consists entirely of states in P(u) then by induction it follows that s' G 
Closep(„)(5n V(P(u))). Otherwise, p contain a state from P(w). However, by Lemma 40(ii) #p(„) is 
on p and hence 0p( u ) G Z. It follows that s' G G and therefore s' G Closep(„)(G n V(P(u))). 

Case 2: s' G V(P(w)). As above, with the exception that <pp( u ) is now the state in Z. 

In all cases s' G S' and the result follows. □ 



6.5.3 Implementing the Algorithm 

Next we show how to efficiently implement the above algorithm in parallel. The key ingredient is a compact 
mapping of states into positions in bitstrings. Suppose B is the separator tree of depth d for a pTNFA 
P with in states. The separator mapping M maps the states of P into an interval of integers [1,1], where 
I = 3 • 2 d . The mapping is defined recursively according to the separator tree. Let v be the root of B. If v is a 
leaf node the interval is [1, 3]. The two states of P, Op and <j)p, are mapped to positions 2 and 3, respectively, 
while position 1 is left intentionally unmapped. Otherwise, let u and w be the children of v. Recursively, 
map P(u) to the interval [1, Z/2] and P(w) to the interval [1/2 + 1,1]. Since the separator tree contains at 
most 2 d leaves and each contribute 3 positions the mapping is well-defined. The size of the interval for P is 
I = 3 • 2 log3 / 2 m +°( 1 ) = 0(m). We will use the unmapped positions as test bits in our algorithm. 

The separator mapping compactly maps all pTNFAs represented in B into small intervals. Specifically, if 
v is a node at depth k in B, then P(v) is mapped to an interval of size l/2 k of the form \{i — 1) • Jfe + 1, i ■ p-], 
for some 1 < i < 2 k . The intervals that correspond to a pTNFA P(v) are mapped and all other intervals 
are unmapped. We will refer to a state s of P by its mapped position M(s). A state-set of P is represented 
by a bitstring S such that, for all mapped positions i, S[i] = 1 iff the i is in the state-set. Since m = 0(w), 
state-sets are represented in a constant number of words. 

To implement the algorithm we define a simple data structure consisting of four length / bitstrings X%, 
X%, E® k , and for each level k of the separator tree. For notational convenience, we will consider the 
strings at level k as two-dimensional arrays consisting of 2 k intervals of length l/2 k , i.e., X®[i,j] is position j 
in the ith interval of X e k . If the ith interval at level k is unmapped then all positions in this interval are in 
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all four strings. Otherwise, suppose that the interval corresponds to a pTNFA P(v) and let X(v) = {9 V , <j) v }. 
The strings are defined as follows: 

1 iff 9 V is e- reachable in P(v) from state j, 
1 iff state j is e- reachable in P(v) from 6 V , 
1 iff is e-reachable in P(v) from state j, 
1 iff state j is e-reachable in P(v) from <f> v . 

In addtion to these, we also store a string I k containing a test bit for each interval, that is, Ik[i,j] = 1 iff 
j = 1. Since the depth of B is O(logm) the strings use O(logm) words. With a simple depth-first search 
they can all be computed in 0(m log m) time. 

Let S be a bitstring representing a state-set of A. We implement the operation Closers') by computing 
a sequence of intermediate strings So, ■ ■ ■ , Sj each corresponding to a level in the above recursive algorithm. 
Initially, So := S and the final string Sd is the result of Closer (<S). At level k, < k < d, we compute Sk+i 
from Sk as follows. Let t = l/2 k — 1. 





— Sk & -Xj. 




= | J fc ) - (/ fc »t))kl k 




= Z e - (Z e » t) 


G e 




Y'f' 


= s k k xt 


Z 4> 


= {{Y+\I k )-(I k »t))kI k 


p<P 


= Z* - (Z* » t) 


G* 


= F 4 'kEt 


fe+i 


= S k \G e \G< t > 



We argue that the computation correctly simulates (in parallel) a level of the recursive algorithm. Assume 
that at the beginning of level k the string S k represents the state-set corresponding the recursive algorithm 
after k levels. We interpret S k as divided into r = l/2 k intervals of length t = l/2 k — 1, each prefixed with a 
test bit, i.e., 

Sk = 051,151,2 • • ■ Si jt OS2,lS2,2 • • • S2,i0 . . . 0s r ,lS r , 2 ■ ■ ■ S r ,t 

Assume first that all these intervals are mapped intervals corresponding to pTNFAs P(v\), . . . , P(v r ), and 
let X(vi) = {9 Vi , <j) Vi }, 1 < i < r. Initially, S k & X® produces the string 

Y 9 = Oj/1,12/1,2 • • • yi,t0j/2,lJ/2,2 • • • V2,t0 ■ ■ ■ 0y r sVr,2 ■ ■ ■ Ur,t, 

where y^j — 1 iff Vi is e-reachable in P(vi) from state j and j is in S k - Then, similar to the second line in 
the simple algorithm, (Y e \ Ik) — (Ik >> t) & I k produces a string of test bits Z 6 = ^iO'^O* . . . z r 0*, where 
Zi = 1 iff at least one of y^i . . . y^t is 1. In other words, z% = 1 iff 9 Vi is e-reachable in P(u») from any state 
in Sk H V(P(vi)). Intuitively, the Z e corresponds to the "0-part" of the of Z-set in the recursive algorithm. 
Next we "copy" the test bits to get the string F e = Z e - (Z 8 » t) = §z\§z\ . . .0z*. The bitwise & with 
El gives 

G B = 0.91,1^1^ ■ • ■ ffi,t052,i52,2 • • • 92, t0 . . . 0g ri ig r , 2 ■ ■ ■ g r ,t- 

By definition, gij — 1 iff state j is e-reachable in P(vi) from 9 Vi and Zi = 1. In other words, G® represents, 
for 1 < i < r, the states in P(i>i) that are e-reachable from S k H V(P(i>j)) through ^„ 4 . Again, notice the 
correspondance with the G-set in the recursive algorithm. The next 4 lines are identical to first 4 with the 
exception that 9 is exchanged by (f>. Hence, G* represents the states that e-reachable through </> Vl , . . . , </> Vr . 



X e k [i,j] = 
E e k [i,j] = 
Xt[i,j} = 
Et[i,j] = 
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Finally, Sk | G® \ computes the union of the states in Sk , G 6 , and G^ producing the desired state-set 
Sk+i for the next level of the recursion. In the above, we assumed that all intervals were mapped. If this 
is not the case it is easy to check that the algorithm is still correct since the string in our data structure 
contain Os in all unmapped intervals. The algorithm uses constant time for each of the d = O (log to) levels 
and hence the total time is O (log to). 

6.5.4 The Simulation Data Structure 

Next we show how to get a full simulation data structure. First, note that in the separator mapping the 
endpoints of the a-transitions are consecutive (as in Sec. 6.4). It follows that we can use the same algorithm 
as in the previous section to compute Move^ in O(l) time. This requires a dictionary of bitstrings, D a , 
using additional 0(m) space and O(mlogTO) preprocessing time. The Insert^, and Member^ operations are 
trivially implemented in 0(1). Putting it all together we have: 

Lemma 42 For a TNFA with to = O(w) states there is a simulation data structure using O(m) space and 
0(mlogm) preprocessing time which supports all operations in 0(\ogm) time. 

Combining the simulation data structures from Lemmas 39 and 42 with the reduction from Lemma 38 and 
taking the best result gives Theorem 16. Note that the simple simulation data structure is the fastest when 
to = 0(y/w) and n is sufficiently large compared to m. 

6.6 Remarks and Open Problems 

The presented algorithms assume a unit-cost multiplication operation. Since this operation is not in AC 
(the class of circuits of polynomial size (in w), constant depth, and unbounded fan-in) it is interesting 
to reconsider what happens with our results if we remove multiplication from our machine model. The 
simulation data structure from Sec. 6.4 uses multiplication to compute Closer and also for the constant time 
hashing to access D a . On the other hand, the algorithm of Sec. 6.5 only uses multiplication for the hashing. 
However, Lemma 42 still holds since we can simply replace the hashing by binary search tree, which uses 
O(logm) time. It follows that Theorem 16 still holds except for the 0(n + to 2 ) bound in the last line. 

Another interestring point is to compare our results with the classical Shift-Or algorithm by Baeza- Yates 
and Gonnet [BYG92] for exact pattern matching. Like ours, their algorithm simulates a NFA with m states 
using word-level parallelism. The structure of this NFA permits a very efficient simulation with an 0(w) 
speedup of the simple 0(nm) time simulation. Our results generalize this to regular expressions with a 
slightly worse speedup of 0(w/\ogw). We wonder if it is possible to remove the 0(logu>) factor separating 
these bounds. 

From a practical viewpoint, the simple algorithm of Sec. 6.4 seems very promising since only about 15 
instructions are needed to carry out a step in the state-set simulation. Combined with ideas from [NR04] we 
believe that this could lead to a practical improvement over previous algorithms. 
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Abstract 



We study the approximate string matching and regular expression matching problem for the case when 
the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We 
present a time-space trade-off that leads to algorithms improving the previously known complexities for 
both problems. In particular, we significantly improve the space bounds, which in practical applications 
are likely to be a bottleneck. 

7.1 Introduction 

Modern text databases, e.g. for biological and World Wide Web data, are huge. To save time and space, it is 
desircablc if data can be kept in compressed form and still allow efficient searching. Motivated by this Amir 
and Benson [AB92a, AB92b] initiated the study of compressed pattern matching problems, that is, given a 
text string Q in compressed form Z and a specified (uncompressed) pattern P, find all occurrences of P in Q 
without decompressing Z . The goal is to search more efficiently than the naive approach of decompressing Z 
into Q and then searching for P in Q. Various compressed pattern matching algorithms have been proposed 
depending on the type of pattern and compression method, see e.g., [AB92b,FT98,KTS+98,KNU03,Nav03, 
MUN03]. For instance, given a string Q of length u compressed with the Ziv-Lempel- Welch scheme [Wel84] 
into a string of length n, Amir et al. [ABF96] gave an algorithm for finding all exact occurrences of a pattern 
string of length m in 0(n + m 2 ) time and space. 

In this paper we study the classical approximate string matching and regular expression matching prob- 
lems in the context of compressed texts. As in previous work on these problems [KNU03, Nav03] we focus 
on the popular ZL78 and ZLW adaptive dictionary compression schemes [ZL78, Wel84]. We present a new 
technique that gives a general time-space trade-off. The resulting algorithms improve all previously known 
complexities for both problems. In particular, wc significantly improve the space bounds. When searching 
large text databases, space is likely to be a bottleneck and therefore this is of crucial importance. 

7.1.1 Approximate String Matching 

Given strings P and Q and an error threshold k, the classical approximate string matching problem is to find 
all ending positions of substrings of Q whose edit distance to P is at most k. The edit distance between two 
strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to 
the other. The classical dynamic programming solution due to Sellers [Sel80] solves the problem in 0(um) 
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time and 0(to) space, where u and to are the length of Q and P, respectively. Several improvements of this 
result are known, see e.g., the survey by Navarro [NavOla]. For this paper we are particularly interested in 
the fast solution for small values of k, namely, the 0(uk) time algorithm by Landau and Vishkin [LV89] and 
the more recent 0(uk 4 /m + u) time algorithm due to Cole and Hariharan [CH02] (we assume w.l.o.g. that 
k < to). Both of these can be implemented in 0(m) space. 

Recently, Karkkainen et al. [KNU03] studied this problem for text compressed with the ZL78/ZLW com- 
pression schemes. If n is the length of the compressed text, their algorithm achieves 0(nmk + occ) time 
and 0(nmk) space, where occ is the number of occurrences of the pattern. Currently, this is the only 
non-trivial worst-case bound for the general problem on compressed texts. For special cases and restricted 
versions, other algorithms have been proposed [MKT + 00,NR98]. An experimental study of the problem and 
an optimized practical implementation can be found in [NKT+01]. 

In this paper, we show that the problem is closely connected to the uncompressed problem and we 
achieve a simple time-space trade-off. More precisely, let t(m, u, k) and s(m, u, k) denote the time and space, 
respectively, needed by any algorithm to solve the (uncompressed) approximate string matching problem 
with error threshold k for pattern and text of length to and u, respectively. We show the following result. 

Theorem 17 Let Q be a string compressed using ZL78 into a string Z of length n and let P be a pattern of 
length m. Given Z, P, and a parameter r > 1, we can find all approximate occurrences of P in Q with at 
most k errors in 0(n(r + to + t(m, 2m + 2k, fe)) + occ) expected time and 0(n/r + m + s(m, 2m + 2k, k) + occ) 
space. 

The expectation is due to hashing and can be removed at an additional 0(n) space cost. In this case the 
bound also hold for ZLW compressed strings. We assume that the algorithm for the uncompressed problem 
produces the matches in sorted order (as is the case for all algorithms that we are aware of). Otherwise, 
additional time for sorting must be included in the bounds. To compare Theorem 17 with the result of 
Karkkainen et al. [KNU03], plug in the Landau- Vishkin algorithm and set r = mk. This gives an algorithm 
using 0(nmk + occ) time and 0(n/mk + to + occ) space. This matches the best known time bound while 
improving the space by a factor 6(m 2 fc 2 ). Alternatively, if we plug in the Cole-Hariharan algorithm and 
set r — k 4 + to we get an algorithm using 0(nk 4 + nm + occ) time and 0(n/(k 4 + to) + to + occ) space. 
Whenever k = 0{m 1 / 4 ) this is 0(nm + occ) time and 0(n/m + to + occ) space. 

To the best of our knowledge, all previous non-trivial compressed pattern matching algorithms for 
ZL78/ZLW compressed text, with the exception of a very slow algorithm for exact string matching by Amir 
et al. [ABF96], use O(n) space. This is because the algorithms explicitly construct the dictionary trie of the 
compressed texts. Surprisingly, our results show that for the ZL78 compression schemes this is not needed 
to get an efficient algorithm. Conversely, if very little space is available our trade-off shows that it is still 
possible to solve the problem without decompressing the text. 

7.1.2 Regular Expression Matching 

Given a regular expression R and a string Q, the regular expression matching problem is to find all ending 
position of substrings in Q that matches a string in the language denoted by R. The classic textbook solution 
to this problem due to Thompson [Tho68] solves the problem in O(um) time and 0(m) space, where u and 
to are the length of Q and R, respectively. Improvements based on the Four Russian Technique or word-level 
parallelism are given in [Mye92a, BFC05, Bil06]. 

The only solution to the compressed problem is due to Navarro [Nav03]. His solution depends on word 
RAM techniques to encode small sets into memory words, thereby allowing constant time set operations. 
On a unit-cost RAM with to-bit words this technique can be used to improve an algorithm by at most a 
factor 0(w). For w = O(logu) a similar improvement is straightforward to obtain for our algorithm and 
we will therefore, for the sake of exposition, ignore this factor in the bounds presented below. With this 
simplification Navarro's algorithm uses 0(nm 2 + occ ■ m log to) time and 0(nm 2 ) space, where n is the length 
of the compressed string. In this paper we show the following time-space trade-off: 
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Theorem 18 Let Q be a string compressed using ZL78 or ZLW into a string Z of length n and let R be a 
regular expression of length m. Given Z , R, and a parameter t > 1, we can find all occurrences of substrings 
matching R in Q in 0(nm(m + t) + occ ■ mlogm) time and 0(nm 2 /r + nm) space. 

If we choose r = m we obtain an algorithm using O (nm 2 + occ -m log m) time and 0(nm) space. This matches 
the best known time bound while improving the space by a factor 6(m). With word-parallel techniques these 
bounds can be improved slightly. The full details are given in Section 7.4.5. 

7.1.3 Techniques 

If pattern matching algorithms for ZL78 or ZLW compressed texts use Q(n) working space they can explicitly 
store the dictionary trie for the compressed text and apply any linear space data structure to it. This has 
proven to be very useful for compressed pattern matching. However, as noted by Amir ct al. [ABF96], 
f2(n) working space may not be feasible for large texts and therefore more space-efficient algorithms are 
needed. Our main technical contribution is a simple o(n) data structure for ZL78 compressed texts. The 
data structure gives a way to compactly represent a subset of the trie which combined with the compressed 
text enables algorithms to quickly access relevant parts of the trie. This provides a general approach to solve 
compressed pattern matching problems in o(n) space, which combined with several other techniques leads 
to the above results. 

7.2 The Ziv-Lempel Compression Schemes 

Let E be an alphabet containing a = |E| characters. A string Q is a sequence of characters from E. The 
length of Q is u — \Q\ and the unique string of length is denoted e. The ith character of Q is denoted Q[i] 
and the substring beginning at position i of length j — i + 1 is denoted Q[i, j]. The Ziv-Lempel algorithm 
from 1978 [ZL78] provides a simple and natural way to represent strings, which we describe below. Define a 
ZL78 compressed string (abbreviated compressed string in the remainder of the paper) to be a string of the 
form 



where ?*j £ {0, . . . ,i — 1} and on € E. Each pair Zi = (rj, a{) is a compression element, and and at are 
the reference and label of Zi, denoted by reference(zi) and label(zj), respectively. Each compression element 
represents a string, called a phrase. The phrase for z%, denoted phrase(zi), is given by the following recursion. 



The • denotes concatenation of strings. The compressed string Z represents the concatenation of the phrases, 
i.e., the string phrase(zi) • • • phrase(z„). 

Let Q be a string of length u. In ZL78, the compressed string representing Q is obtained by greedily 
parsing Q from left-to- right with the help of a dictionary D. For simplicity in the presentation we assume 
the existence of an initial compression element z = (0, e) where phrase(z ) = e. Initially, let z = (0, e) 
and let D = {e}. After step i we have computed a compressed string zqz\ ■ ■ ■ z% representing Q[l, j] and 
D = {phrasc(zo), . . . , phrase(zi)}. We then find the longest prefix of Q[j + 1, u — 1] that matches a string in 
D, say phrase(zfc), and let phrase(zi+i) = phrase(zfe) • Q[j + 1 + |phrase(zfe)|]. Set D = D U {phrasc(zi + i)} 
and let Zi+i — (fc, Q[j + 1 + |phrase(^+i)|]). The compressed string zqZi . . . z%+i now represents the string 
j + |p nra se(zi+i)|]) and D = {phrase(zo), . . . , phrase(zi + i)}. We repeat this process until all of Q has 
been read. 

Since each phrase is the concatenation of a previous phrase and a single character, the dictionary D is 
prefix-closed, i.e., any prefix of a phrase is a also a phrase. Hence, we can represent it compactly as a trie 
where each node i corresponds to a compression element Zi and phrasc(zi) is the concatenation of the labels 
on the path from Zi to node i. Due to greediness, the phrases are unique and therefore the number of nodes 



Z = 



zi ■ ■ ■ z n = 



{r 1 ,a 1 )(r 2 ,a 2 ) ■ ■ ■ (r n ,a n ), 
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Q = ananas 
Z = (0,a)(0,n)(l,n)(l,s) 



Figure 7.1: The compressed string Z representing Q and the corresponding dictionary trie D. Taken 
from [Nav03]. 

in D for a compressed string Z of length n is n+1. An example of a string and the corresponding compressed 
string is given in Fig. 7.1. 

Throughout the paper we will identify compression elements with nodes in the trie D, and therefore we 
use standard tree terminology, briefly summed up here: The distance between two elements is the number 
of edges on the unique simple path between them. The depth of element z is the distance from z to zo (the 
root of the trie). An element x is an ancestor of an element z if phrase(x) is a prefix of phrase^). If also 
|phrase(a;)| = |phrase(z)| — 1 then x is the parent of z. If x is ancestor of z then z is a descendant of x and 
if x is the parent of z then z is the child of a;. The length of a path p is the number of edges on the path, and 
is denoted \p\. The label of a path is the concatenation of the labels on these edges. 

Note that for a compression element z, reference^) is a pointer to the parent of z and label(z) is the 
label of the edge to the parent of z. Thus, given z we can use the compressed text Z directly to decode the 
label of the path from z towards the root in constant time per element. We will use this important property 
in many of our results. 

If the dictionary D is implemented as a trie it is straightforward to compress Q or decompress Z in 
0(u) time. Furthermore, if we do not want to explicitly decompress Z we can compute the trie in 0{ri) 
time, and as mentioned above, this is done in almost all previous compressed pattern matching algorithm on 
Ziv-Lempel compression schemes. However, this requires at least f2(n) space which is insufficient to achieve 
our bounds. In the next section we show how to partially represent the trie in less space. 

7.2.1 Selecting Compression Elements 

Let Z = z . . . z n be a compressed string. For our results we need an algorithm to select a compact subset of 
the compression elements such that the distance from any element to an element in the subset is no larger 
than a given threshold. More precisely, we show the following lemma. 

Lemma 43 Let Z be a compressed string of length n and let 1 < t < n be parameter. There is a set 
of compression elements C of Z , computable in 0(nr) expected time and 0(n/r) space with the following 
properties: 

ft) \C\ = 0(n/r). 

(ii) For any compression element Zi in Z, the minimum distance to any compression element in C is at 
most 2t. 

Proof. Let 1 < t < n be a given parameter. We build C incrementally in a left-to-right scan of Z. The set is 
maintained as a dynamic dictionary using dynamic perfect hashing [DKM + 94], i.e., constant time worst-case 
access and constant time amortized expected update. Initially, we set C — {zq}. Suppose that we have 
read zo, . . . , Zj. To process Zj+i we follow the path p of references until we encounter an element y such that 
y € C . We call y the nearest special element of Zj+i. Let I be the number of elements in p including Zi + \ and 
y. Since each lookup in C takes constant time the time to find the nearest special element is O(l). If I < 2 • r 
we are done. Otherwise, if I = 2 • r, we find the Tth element y' in the reference path and set C :— C U {?/'}. 
As the trie grows under addition of leaves condition (ii) follows. Moreover, any element chosen to be in C 
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has at least r descendants of distance at most r that are not in C and therefore condition (i) follows. The 
time for each step is O(r) amortized expected and therefore the total time is 0(nr) expected. The space is 
proportional to the size of C hence the result follows. □ 



7.2.2 Other Ziv-Lempel Compression Schemes 

A popular variant of ZL78 is the ZLW compression scheme [Wel84]. Here, the label of compression elements 
are not explicitly encoded, but are defined to be the first character of the next phrase. Hence, ZLW does not 
offer an asymptotically better compression ratio over ZL78 but gives a better practical performance. The 
ZLW scheme is implemented in the UNIX program compress. From an algorithmic viewpoint ZLW is more 
difficult to handle in a space-efficient manner since labels are not explicitly stored with the compression 
elements as in ZL78. However, if Q(n) space is available then we can simply construct the dictionary trie. 
This gives constant time access to the label of a compression elements and therefore ZL78 and ZLW become 
"equivalent". This is the reason why Theorem 17 holds only for ZL78 when space is o(n) but for both when 
the space is O(n). 

Another well-known variant is the ZL77 compression scheme [ZL77]. Unlike ZL78 and ZLW phrases in 
the ZL77 scheme can be any substring of text that has already been processed. This makes searching much 
more difficult and none of the known techniques for ZL78 and ZLW seems to be applicable. The only known 
algorithm for pattern matching on ZL77 compressed text is due to Farach and Thorup [FT98] who gave an 
algorithm for the exact string matching problem. 

7.3 Approximate String Matching 

In this section we consider the compressed approximate string matching problem. Before presenting our 
algorithm we need a few definitions and properties of approximate string matching. 

Let A and B be strings. Define the edit distance between A and B, j(A, B), to be the minimum number 
of insertions, deletions, and substitutions needed to transform A to B. We say that j e [1, \S\] is a match 
with error at most k of A in a string S if there is an i G [1, j] such that j(A, S[i, j]) < k. Whenever k is 
clear from the context we simply call j a match. All positions i satisfying the above property are called a 
start of the match j. The set of all matches of A in S is denoted T(A, S). We need the following well-known 
property of approximate matches. 

Proposition 5 Any match j of A in S with at most k errors must start in the interval [max(l, j — \A\ + 
l-k),wm(\S\,j-\A\ + l + k)]. 

Proof. Let I be the length of a substring B matching A and ending at j. If the match starts outside 
the interval then either I < \A\ — k or Z > \A\ + k. In these cases, more than k deletions or k insertions, 
respectively, are needed to transform B to A. □ 



7.3.1 Searching for Matches 

Let P be a string of length m and let k be an error threshold. To avoid trivial cases we assume that k < m. 
Given a compressed string Z — z zi . . . z n representing a string Q of length u we show how to find T(P, Q) 
efficiently 

Let U = |phrase(zi)|, let u = 1, and let Ui = ttj_i + h-i, for 1 < i < n, i.e., Zj is the length of the zth 
phrase and ttj is the starting position in Q of the ith phrase. We process Z from left-to-right and at the zth 
step we find all matches in [iti, + Zj — 1]. Matches in this interval can be either internal or overlapping (or 
both). A match j in [iti, Ui + li — 1] is internal if it has a starting point in [itj, m + h — 1] and overlapping if 
it has a starting point in [l,u, — 1]. To find all matches we will compute the following information for z*. 
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rpre(2j_i) rpre(zi) 
ii 





phrase(z,_i) 


phrase(^j) 





rsuf(zj-i) rsuf(zj) 



Figure 7.2: The relevant prefix and the relevant suffix of two phrases in Q. Here, |phrase(,2i_i)| < m + k 
and therefore rsuf(zi_i) overlaps with previous phrases. 

• The start position, m, and length, Z,, of phrase^). 

• The relevant prefix, rpre(zi), and the relevant suffix, rsuf(^), where 

rpre(^) = Q[ui, min(uj + m + k — 1, itj + k - 1)] , 
rsuf(zi) = Q[max(l, + Zj — m — fc), u, + Z, — 1] . 

In other words, rpre(zi) is the largest prefix of length at most m + k of phrase(zi) and rsuf(zj) is the 
substring of length m + k ending at Uj + Zj — 1. For an example see Fig. 7.2. 

• The match sets Mj(zj) and Mo(zi), where 

= T(P, phrase^)) , 
M (zi) = r(P,rsuf(^_i) • rpre^)) . 

We assume that both sets are represented as sorted lists in increasing order. 

We call the above information the description of Z{. In the next section we show how to efficiently 
compute descriptions. For now, assume that we are given the description of Zj. Then, the set of matches in 
[ui,Ui + li — 1] is reported as the set 

M{ Zi ) = {j +m-l\j€ Mj(2i)} u 

{j + Ui-1- |rsuf(zi_i)| | j e M (zi) n [ui,Ui + h - 1]} . 

We argue that this is the correct set. Since phrase(zi) = Q[ui, + Z, — 1] we have that 

j e M^Zi) ^j + Ui-le T(P, Q[u it Ui + li-1}. 

Hence, the set {j + Ui — 1 | j G Mi(zi)} is the set of all internal matches. Similarly, rsuf(zj_i) • rpre(zi) = 
Q[ui — |rsuf(zi_i)|, Ui + |rpre(zi)| — 1] and therefore 

j E M (zi) <&j + Ui-l- |rsuf(zi_i)| e T(P, Q[m 4 - |rsuf(^_i)|,Mj + 1 + |rpre(^)|]) . 

By Proposition 5 any overlapping match must start at a position within the interval [max(l, Ui— m+l — k), Ui]. 
Hence, {j + Ui — 1 — |rsuf(^_i)| | j G Mo(zi)} includes all overlapping matches in + Zj — 1]. Taking 

the intersection with \ui,Ui + Zj — 1] and the union with the internal matches it follows that the set M(zi) 
is precisely the set of matches in [m, ui + Z« — 1]. For an example see Fig. 7.3. 

Next we consider the complexity of computing the matches. To do this we first bound the size of the 
Mj and Mo sets. Since the length of any relevant suffix and relevant prefix is at most m + k, we have that 
|Mo(zi)| < 2(m + k) < 4m, and therefore the total size of the Mo sets is at most 0(nm). Each element in 
the sets Mj(zo), ■ . ■ , Mj(z n ) corresponds to a unique match. Thus, the total size of the Mi sets is at most 
occ, where occ is the total number of matches. Since both sets are represented as sorted lists the total time 
to compute the matches for all compression elements is 0(nm + occ). 
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Figure 7.3: Example of descriptions. Z is the compressed string representing Q. We are looking for all 
matches of the pattern P with error threshold k = 2 in Z. The set of matches is {6, 7, 8, 9, 10, 12}. 



7.3.2 Computing Descriptions 

Next we show how to efficiently compute the descriptions. Let 1 < r < n be a parameter. Initially, we 
compute a subset C of the elements in Z according to Lemma 43 with parameter r. For each element Zj G C 
we store Zj, that is, the length of phrase(zj). If Zj > m + k we also store the index of the ancestor x of zj 
of depth m + k. This information can easily be computed while constructing C within the same time and 
space bounds, i.e., using 0(nr) time and 0(n/r) space. 

Descriptions are computed from left-to-right as follows. Initially, set Zo = 0, uo = 0, rprc(zo) = e, 
rsuf(zo) = £j Mj(z ) = 0, and Mo(zo) = 0- To compute the description of Zj, 1 < i < n, first follow the path 
p of references until we encounter an element Zj G C. Using the information stored at zj we set Zj := \p\ + lj 
and Ui = Ui-i + Zj_i. By Lemma 43(h) the distance to Zj is at most 2r and therefore Zj and Uj can be 
computed in 0(t) time given the description of Zi-\. 

To compute rpre(zj) we compute the label of the path from z towards Zi of length min(m + k, Zj). There 
are two cases to consider: If Zj < m + k we simply compute the label of the path from Zj to z and let rpre(zj) 
be the reverse of this string. Otherwise (Zj > m + k), we use the "shortcut" stored at Zj to find the ancestor 
z h of distance m + k to z . The reverse of the label of the path from z h to z is then rpre(zj). Hence, rpre(zj) 
is computed in 0(m + k + r) = 0(m + r) time. 

The string rsuf(zj) may be the divided over several phrases and we therefore recursively follow paths 
towards the root until we have computed the entire string. It is easy to see that the following algorithm 
correctly decodes the desired substring of length min(m + k, m) ending at position Ui + U — 1. 

1. Initially, set Z := min(m + k, Ui + k — 1), t := i, and s := e. 

2. Compute the path p of references from z t of length r = min(Z, dcpth(z t )) and set s := s ■ label(p). 

3. If r < Z set Z := Z — r, t := t — 1, and repeat step 2. 

4. Return rsuf (zj) as the reverse of s. 

Since the length of rsuf(zj) is at most m + k, the algorithm finds it in 0(m + k) = 0(m) time. 

The match sets M/ and Mo are computed as follows. Let t(m, u, k) and s(m, u, k) denote the time and 
space to compute T(A, B) with error threshold k for strings A and B of lengths m and u, respectively. Since 
|rsuf(zj_i)-rpre(zj)| < 2m+2/c it follows that Mo{z%) can be computed in t(m, 2m+2k, k) time and s(m, 2m+ 
2k, k) space. Since M/(zj) = T(P, phrase(zj)) we have that j G M/(zj) if and only if j G M/(reference(zi)) or 
j = U. By Proposition 5 any match ending in Zj must start within [max(l, Zj — m+1 — k), min(Zj, Zj — m+l+k)]. 
Hence, there is a match ending in Zj if and only if Zj G T(P, rsuf'(zj)) where rsuf'(zj) is the suffix of phrase(zj) 
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of length min(m + k, k). Note that rsuf'(^i) is a suffix of rsuf(zj) and we can therefore trivially compute it 
in 0(m + k) time. Thus, 

M 7 (z 4 ) = M 7 (reference(^)) U {h \ h e r(P,rsuf'(^))} • 

Computing T(P, rsuf'(zi)) uses t(m,m + k,k) time and s(m,m + k,k) space. Subsequently, constructing 
Mi(zi) takes 0(\Mi(zi)\) time and space. Recall that the elements in the Mi sets correspond uniquely 
to matches in Q and therefore the total size of the sets is occ. Therefore, using dynamic perfect hash- 
ing [DKM+94] on pointers to non-empty Mi sets we can store these using O(occ) space in total. 

7.3.3 Analysis 

Finally, we can put the pieces together to obtain the final algorithm. The preprocessing uses 0(nr) expected 
time and 0(n/r) space. The total time to compute all descriptions and report occurrences is expected 
0(n(r + m + t(m,2m + 2k, k)) + occ). The description for Zi, except for Mi(zi), depends solely on the 
description of Zi-\. Hence, we can discard the description of except for Mi(zi-i), after processing 

Zi and reuse the space. It follows that the total space used is 0(n/r + m + s(m, 2m + 2k, k) + occ). This 
completes the proof of Theorem 17. Note that if we use f2(n) space we can explicitly construct the dictionary. 
In this case hashing is not needed and the bounds also hold for the ZLW compression scheme. 

7.4 Regular Expression Matching 

7.4.1 Regular Expressions and Finite Automata 

First we briefly review the classical concepts used in the paper. For more details see, e.g., Aho et al. [ASU86]. 
The set of regular expressions over X are defined recursively as follows: A character a € £ is a regular 
expression, and if 5* and T are regular expressions then so is the concatenation, (S) ■ (T), the union, (S)\(T), 
and the star, (S)* . The language L{R) generated by R is defined as follows: L(a) = {a}, L(S ■ T) = 
L(S) ■ L(T), that is, any string formed by the concatenation of a string in L(S) with a string in L(T), 
L(S)\L(T) = L(S) U L(T), and L(S*) = \J t >o L (S)\ where L(S)° = {e} and i(S*) 1 = ^(S*) 1 - 1 • L(S), for 
i > 0. 

A finite automaton is a tuple A — (V, E, S, 9, $), where V is a set of nodes called states, E is set of 
directed edges between states called transitions each labeled by a character from S U {e}, 9 e V is a start 
state, and <f> C V is a set of final states. In short, A is an edge- labeled directed graph with a special start 
node and a set of accepting nodes. A is a deterministic finite automaton (DFA) if A does not contain any e- 
transitions, and all outgoing transitions of any state have different labels. Otherwise, A is a non- deterministic 
automaton (NFA). 

The label of a path p in A is the concatenation of labels on the transitions in p. For a subset 5* of states 
in A and character a e S U {e}, define the transition map, 8(S,ce), as the set of states reachable from S 
via a path labeled a. Computing the set S(S, a) is called a state-set transition. We extend 6 to strings by 
defining S(S, a ■ B) = 8(8(S, a), B), for any string B and character a € E. We say that A accepts the string 
B if 5({9}, B) n $ 7^ 0. Otherwise A rejects Q. As in the previous section, we say that j E [1, \B\] is a match 
iff there is an i e [1, j] such that A accepts B[i,j]. The set of all matches is denoted A(A, B). 

Given a regular expression R, an NFA A accepting precisely the strings in L{R) can be obtained by 
several classic methods [MY60, Glu61, Tho68]. In particular, Thompson [Tho68] gave a simple well-known 
construction which we will refer to as a Thompson NFA (TNFA). A TNFA A for R has at most 2m states, 
at most 4m transitions, and can be computed in 0(m) time. Hence, a state-set transition can be computed 
in 0(m) time using a breadth-first search of A and therefore we can test acceptance of Q in 0(um) time 
and 0{m) space. This solution is easily adapted to find all matches in the same complexity by adding the 
start state to each of the computed statc-scts immediately before computing the next. Formally, 5(S, a ■ 
B) = S(S(S U {9},a),B), for any string B and character a € S. A match then occurs at position j if 

8({6},Q[i,j])n$^<D. 
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Figure 7.4: The compressed string Z representing Q and the corresponding dictionary trie D. The TNFA A 
for the regular expression R and the corresponding state-sets S Ui are given. The lastmatch pointers are as fol- 
lows: lastmatch(s7, Zi) = {zq} for i = 0, 1, . . . , 8, lastmatch(s2, zi) — lastmatch(s4, Zi) = lastmatch(s5, zi) = 
{2:3} for i — 3,6, and lastmatch(s6, Zi) — {z 2 } for i = 2,7. All other lastmatch pointers are _L. Using 
the description we can find the matches: Since S2 <G S Ua the element 2:3 S M(s2,z§) represents the match 
uq + depth^) — 1 = 9. The other matches can be found similarly. 



7.4.2 Searching for Matches 

Let A = (V, E, $) be a TNFA with m states. Given a compressed string Z = z\...z n representing 
a string Q of length u we show how to find A(A,Q) efficiently. As in the previous section let k and m, 

< i < n be the length and start position of phrase^). We process Z from left-to-right and compute a 
description for Zi consisting of the following information. 

• The integers k and Ui. 

• The state-set S Ui = S({9}, Q[l, uj + U - 1). 

• For each state s of A the compression element lastmatch(s, z;) = x, where x is the ancestor of Zi of 
maximum depth such that 5({s}, phrase(x)) n $ ^ 0. If there is no ancestor that satisfies this, then 
lastmatch(s, z{) — _L. 

An example description is shown in Fig. 7.4. The total size of the description for Zi is 0(m) and therefore 
the space for all descriptions is O(nm). In the next section we will show how to compute the descriptions. 
Assume for now that we have processed zo , • ■ ■ , z i- 1 • We show how to find the matches within [ui , Ui + U — 1] . 
Given a state s define M(s, zi) — {x\, . . . , Xk}, where x\ — lastmatch(s, Zi), Xj = lastmatch(s, parent (xj_i)), 

1 < j < k, and lastmatch(s, Xk) — -L, i.e., xi, . . . , Xk is the sequence of ancestors of Zi obtained by recursively 
following lastmatch pointers. By the definition of lastmatch and M(s, Zi) it follows that M{s, Zi) is the set 
of ancestors x of s such that <5(s, x) n $ 7^ 0. Hence, if ,s e S Ui _ 1 then each element x € M(s, Zi) represents a 
match, namely, Uj + depth(x) — 1. Each match may occur for each of the \S Ui _ 1 \ states and to avoid reporting 
duplicate matches we use a priority queue to merge the sets M{s, Zi) for all s € S Ui _ 1 , while generating these 
sets in parallel. A similar approach is used in [Nav03]. This takes O(logm) time per match. Since each match 
can be duplicated at most \S Ui _ 1 \ = 0(m) times the total time for reporting matches is O(occ ■ mlogm). 
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7.4.3 Computing Descriptions 

Next we show how to compute descriptions efficiently. Let 1 < r < n be a parameter. Initially, compute a 
set C of compression elements according to Lemma 43 with parameter r. For each element zj € C we store 
lj and the transition sets 5(s, phrase(^j)) for each state s in A. Each transition set uses 0(m) space and 
therefore the total space used for zj is 0(m 2 ). During the construction of C we compute each of the transition 
sets by following the path of references to the nearest element y € C and computing state-set transitions 
from y to Zj. By Lemma 43(h) the distance to y is at most 2t and therefore all of the m transition sets can 
be computed in 0(rm 2 ) time. Since, |C| = 0(n/r) the total preprocessing time is 0(n/r ■ rm 2 ) = 0(nm 2 ) 
and the total space is 0{n/r ■ m 2 ). 

The descriptions can now be computed as follows. The integers k and Ui can be computed as before 
in 0(t) time. All lastmatch pointers for all compression elements can easily be obtained while computing 
the transitions sets. Hence, we only show how to compute the state-set values. First, let S Uo := {9}. To 
compute S Ui from S Ui _ 1 we compute the path p to z 4 from the nearest element y € C. Let p' be the path 
from zq to y. Since phrase(-Zj) = label(p') • label (p) we can compute S Ui — S(S Ui _ 1 , prrrase(z,)) in two steps 
as follows. First compute the set 

S'= |J 5( S ,phrasc(y)) . (7.1) 

Since y G C we know the transition sets S(s, phrase(y)) and we can therefore compute the union in 0(m 2 ) 
time. Secondly, we compute S Ui as the set 6(S' , label (»)). Since the distance to y is at most r this step uses 
O(rm) time. Hence, all the state-sets S UQ , . . . , S Un can be computed in 0(nm(m + r)) time. 

7.4.4 Analysis 

Combining it all, we have an algorithm using 0(nm(m + r) + occ • m logm) time and 0(nm + nm 2 /t) space. 
Note that since we are using f2(n) space, hashing is not needed and the algorithm works for ZLW as well. In 
summary, this completes the proof of Theorem 18. 

7.4.5 Exploiting Word-level Parallelism 

If we use the word-parallelism inherent in the word-RAM model, the algorithm of Navarro [Nav03] uses 
0(\m/w~\ (2 m + nm) + occ ■ mlogra) time and 0(\m/w] (2 m + nm)) space, where w is the number of bits in 
a word of memory and space is counted as the number of words used. The key idea in Navarro's algorithm is 
to compactly encode state-sets in bit strings stored in 0(\m/w~\) words. Using a DFA based on a Glushkov 
automaton [Glu61] to quickly compute state-set transitions, and bitwise OR and AND operations to compute 
unions and intersections among state-sets, it is possible to obtain the above result. The 0(\m/w~\ 2 m ) term 
in the above bounds is the time and space used to construct the DFA. 

A similar idea can be used to improve Theorem 18. However, since our solution is based on Thompson's 
automaton we do not need to construct a DFA. More precisely, using the state-set encoding of TNFAs 
given in [Mye92a, BFC05] a state-set transition can be computed in 0(\m/ \ogn\) time after 0(n) time 
and space preprocessing. Since state-sets are encoded as bit strings each transition set uses [m/logn] 
space and the union in (7.1) can be computed in 0(m [m/logn]) time using a bitwise OR operation. As 
n > \pu in ZL78 and ZLW, we have that logn > ^logu and therefore Theorem 18 can be improved by 
roughly a factor logw. Specifically, we get an algorithm using 0(n \m/\ogu\ (m + r) + occ ■ to log to) time 
and 0(nm \mj logu] /r + nm) space. 



118 



Bibliography 



[AB92a] Amihood Amir and Gary Benson. Efficient two-dimensional compressed matching. In Pro- 
ceedings of the 2nd Data Compression Conference, pages 279-288, 1992. 

[AB92b] Amihood Amir and Gary Benson. Two-dimensional periodicity and its applications. In Pro- 
ceedings of the 3rd Annual ACM-SIAM Symposium on Discrete Algorithms, pages 440-452, 
1992. 

[ABF96] Amihood Amir, Gary Benson, and Martin Farach. Let sleeping files lie: pattern matching in 
Z-compressed files. J. Comput. System Sci., 52(2):299-307, 1996. 

[ADKF70] V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradzev. On economic construction 
of the transitive closure of a directed graph (in russian). english translation in soviet math, 
dokl. 11, 1209-1210, 1975. Dokl. Acad. Nauk., 194:487-488, 1970. 

[AFT06] Tatsuya Akutsu, Daiji Fukagawa, and Atsuhiro Takasu. Approximating tree edit distance 
through string edit distance. In Proceedings of the 17th International Symposium on Algorithms 
and Computation, Lecture Notes in Computer Science, volume 4288, pages 90-99, 2006. 

[AGKR04] Stephen Alstrup, Cyril Gavoille, Haim Kaplan, and Theis Rauhe. Nearest common ancestors: 
A survey and a new algorithm for a distributed environment. Theory Comput. Syst., 37:441- 
456, 2004. 

[AGM+90] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment 
search tool. J. Mol. Biol, 215(3):403-410, 1990. 

[AH94] T. Akutsu and M. M. Halldorsson. On the approximation of largest common point sets 

and largest common subtrees. In Proceedings of the 5th Annual International Symposium on 
Algorithms and Computation, Lecture Notes in Computer Science, volume 834, pages 405-413, 
1994. 

[AH97] Susanne Albers and Torbcn Hagerup. Improved parallel integer sorting without concurrent 

writing. Inform, and Comput., 136:25-51, 1997. 

[AHdLT97] Stephen Alstrup, Jacob Holm, Kristian dc Lichtcnbcrg, and Mikkcl Thorup. Minimizing 
diameters of dynamic trees. In Proceedings of the 24th International Colloquium on Automata, 
Languages and Programming, Lecture Notes in Computer Science, volume 1256, pages 270- 
280, 1997. 

[AHNR98] Arne Andersson, Torbcn Hagerup, Stefan Nilsson, and Rajeev Raman. Sorting in linear time? 
J. Comput. System Sci, 57(l):74-93, 1998. 

[AHR98] Stephen Alstrup, Thore Husfcldt, and Theis Rauhe. Marked ancestor problems. In Proceedings 
of the 39th Annual IEEE Symposium on Foundations of Computer Science, pages 534-543, 
1998. 



119 



[AHTOO] Stephen Alstrup, Jacob Holm, and Mikkel Thorup. Maintaining center and median in dynamic 
trees. In Proceedings of the 7th Scandinavian Workshop on Algorithm Theory, Lecture Notes 
in Computer Science, volume 1851, pages 46-56, 2000. 

[AHU76] A. V. Alio, D. S. Hirschberg, and J. D. Ullman. Bounds on the complexity of the longest 
common subsequence problem. J. ACM, 1(23):1— 12, 1976. 

[Aku06] Tatsuya Akutsu. A relation between edit distance for ordered trees and edit distance for euler 

strings. Inform. Process. Lett, 100(3):105-109, 2006. 

[AKW98] Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. The AWK Programming 
Language. Addison- Wesley, 1998. 

[ALM+98] A. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy. Proof verification and the 
hardness of approximation problems. J. A CM, 45(3):501-555, 1998. 

[ALP 04] Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with 
k mismatches. J. Algorithms, 50(2):257-275, 2004. 

[AR02] Stephen Alstrup and Theis Rauhc. Improved labeling schemes for ancestor queries. In Pro- 

ceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 947-953, 
2002. 

[AS01] Laurent Alonso and R. Schott. On the tree inclusion problem. Acta Inform., 37(9):653-670, 

2001. 

[ASU86] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: principles, techniques, and 
tools. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986. 

[BCGM99] Luc Boasson, Patrick Cegielski, I. Guessarian, and Yuri Matiyasevich. Window-accumulated 
subsequence matching problem is linear. In In Proceedings of the 18th ACM SIGMOD- 
SIG ACT-SIC ART symposium on Principles of Database Systems, pages 327-336, 1999. 

[BFC00] Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Proceedings of 
the 4th Latin American Symposium on Theoretical Informatics, pages 88-94, 2000. 

[BFC05] Philip Bille and Martin Farach-Colton. Fast and compact regular expression matching, 2005. 
Submitted to a journal. Preprint availiable at arxiv.org/cs/0509069. 

[Bil05] Philip Bille. A survey on tree edit distance and related problems. Theoret. Comput. Sci., 

337(l-3):217-239, 2005. 

[Bil06] Philip Bille. New algorithms for regular expression matching. In Proceedings of the 33rd Inter- 

national Colloquium on Automata, Languages and Programming, Lecture Notes in Computer 
Science, volume 4051, pages 643-654, 2006. 

[BML+04] Dcnilson Barbosa, Alberto O. Mcndelzon, Leonid Libkin, Laurent Mignet, and Marcelo Are- 
nas. Efficient incremental validation of XML documents. In Proceedings of the 20th Interna- 
tional Conference on Data Engineering, page 671, 2004. 

[BY89] Ricardo A. Baeza- Yates. Efficient Text Searching. PhD thesis, Dept. of Computer Science, 

University of Waterloo, 1989. 

[BY91] Ricardo A. Baeza- Yates. Searching subsequences. Theoret. Comput. Sci., 78(2):363-376, 1991. 

[BYG92] Ricardo Baeza- Yates and Gaston H. Gonnet. A new approach to text searching. Commun. 
ACM, 35(10):74-82, 1992. 



120 



[BYN96] 

[CD99] 
[CGM97] 

[CH02] 
[Cha06] 

[Cha07] 

[Che98] 

[ChcOO] 
[CheOl] 

[CHI99] 

[Chu87] 
[CLRSOl] 
[CLZU03] 

[CM07] 
[CMT03] 
[CR72] 
[CRGMW96] 

[DDHSOO] 



Ricardo A. Baeza- Yates and Gonzalo Navarro. A faster algorithm for approximate string 
matching. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, 
Lecture Notes in Computer Science, volume 1075, pages 1-23, 1996. 



J. Clark and S. DeRose. XML path language (XPath), 

http : //www . w3 . org/TR/xpath, 1999. 



available as 



S. Chawathe and H. Garcia-Molina. Meaningful change detection in structured data. In 
Proceedings of the A CM SIGMOD International Conference on Management of Data, pages 
26-37, 1997. 

Richard Cole and Ramcsh Hariharan. Approximate string matching: A simpler faster algo- 
rithm. SI AM J. Comput, 31(6):1761-1782, 2002. 

Timothy M. Chan. All-pairs shortest paths for unweighted undirected graphs in 0(mn) time. 
In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 
514-523, 2006. 

Timothy M. Chan. More algorithms for all-pairs shortest paths in weighted graphs. In Pro- 
ceedings of the 39th Annual ACM Symposium on Theory of Computing, 2007. to appear. 

Weimin Chen. More efficient algorithm for ordered tree inclusion. J. Algorithms, 26:370-385, 
1998. 

Weimin Chen. Multi-subsequence searching. Inform. Process. Lett., 74(5-6):229-233, 2000. 

Weimin Chen. New algorithm for ordered tree-to-tree correction problem. J. Algorithms, 
40:135-158, 2001. 

Richard Cole, Ramesh Hariharan, and Piotr Indyk. Tree pattern matching and subset match- 
ing in deterministic 0(nlog 3 n)-time. In Proceedings of the 10th Annual ACM-SIAM Sympo- 
sium on Discrete Algorithms, pages 245-254, 1999. 

M. J. Chung. 0(n 2 ' 5 ) algorithm for the subgraph homeomorphism problem on trees. J. 



Algorithms, 8(1):106-112, 1987. 

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction 
to Algorithms, second edition. MIT Press, 2001. 

Maxime Crochemore, Gad M. Landau, and Michal Ziv-Ukelson. A subquadratic sequence 
alignment algorithm for unrestricted scoring matrices. SIAM J. Comput., 32(6):1654-1673, 
2003. 

Graham Cormode and S. Muthukrishnan. The string edit distance matching problem with 
moves. ACM Trans. Algorithms, 3(1):2, 2007. 

Maxime Crochemore, Borivoj Melichar, and Zdenek Tromcek. Directed acyclic subsequence 
graph: Overview. J. Discrete Algorithms, l(3-4):255-280, 2003. 

Stephen A. Cook and Robert A. Reckhow. Time-bounded random access machines. In Pro- 
ceedings of the 4th Annual ACM Symposium on Theory of Computing, pages 73-80, 1972. 

S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierar- 
chically structured information. In Proceedings of the ACM SIGMOD International Conference 
on Management of Data, pages 493-504, 1996. 

Keith Dicfcndorff, Pradeep K. Dubcy, Ron Hochsprung, and Hunter Scales. AltiVec extension 
to PowerPC accelerates media processing. IEEE Micro, 20(2):85-95, 2000. 



121 



[DFG + 97] Gautam Das, Rudolf Fleischer, Leszek Gasieniec, Dimitrios Gunopulos, and Juha Karkkalncn. 

Episode matching. In Proceedings of the 8th Annual Symposium on Combinatorial Pattern 
Matching, Lecture Notes in Computer Science, volume 1264, pages 12-27, 1997. 

[DGM90] Moshe Dubiner, Zvi Galil, and Edith Magen. Faster tree pattern matching. In Proceedings of 
the 31st Annual IEEE Symposium on the Foundations of Computer Science, pages 145-150, 
1990. 

[Die89] P. F. Dietz. Fully persistent arrays (extended array). In Proceedings of the Workshop on 

Algorithms and Data Structures, Lecture Notes in Computer Science, volume 382, pages 67- 
74, 1989. 

[DKM + 94] Martin Dictzfclbingcr, Anna Karlin, Kurt Mchlhorn, Friedhclm Meyer auf der Heide, Hans 
Rohnert, and Robert Tarjan. Dynamic perfect hashing: Upper and lower bounds. SIAM J. 
Comput., 23(4):738-761, 1994. 

[DMRW06] Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Orcn Wcimann. An 0(n 3 )-time 
algorithm for tree edit distance. Arxiv preprint cs.DS/0604037, April 2006. 

[DMRW07] Erik Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. An optimal decomposi- 
tion algorithm for tree edit distance. In Proceedings of the 34th International Colloquium on 
Automata, Languages and Programming, 2007. 

[DT03] Serge Dulucq and Laurent Tichit. Rna secondary structure comparison: exact analysis of the 

Zhang-Shasha tree edit algorithm. Theoret. Comput. Sci., 306(l-3):471-484, 2003. 

[DT05] Serge Dulucq and Helene Touzct. Decomposition algorithms for the tree edit distance problem. 

J. Discrete Algorithms, 3(2-4):448-471, 2005. 

[EGG88] David Eppstein, Zvi Galil, and Raffaele Giancarlo. Speeding up dynamic programming. In 

Proceedings of the 29th Annual IEEE Symposium on Foundations of Computer Science, pages 
488-496, 1988. 

[EGGI92] David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F. Italiano. Sparse dynamic 
programming i: Linear cost functions. J. ACM, 39(3):519-545, 1992. 

[FM96] P. Ferragina and S. Muthukrishnan. Efficient dynamic method-lookup for object oriented 

languages. In Proceedings of the 4th Annual European Symposium on Algorithms, Lecture 
Notes in Computer Science, volume 1136, pages 107-120, 1996. 

[Fre97] Greg N. Frederickson. Ambivalent data structures for dynamic 2-edge-connectivity and k 

smallest spanning trees. SIAM J. Comput, 26(2):484-538, 1997. 

[FT94] Martin Farach and Mikkel Thorup. Fast comparison of evolutionary trees. In Proceedings of 

the 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 481-488, 1994. 

[FT98] Martin Farach and Mikkel Thorup. String matching in Lempel-Ziv compressed strings. Algo- 

nthmica, 20(4):388-404, 1998. 

[FW93] Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with 

fusion trees. J. Comput. System Set., 47(3):424-436, 1993. 

[FW94] Michael L. Fredman and Dan E. Willard. Trans-dichotomous algorithms for minimum spanning 

trees and shortest paths. J. Comput. System Sci., 48(3):533-551, 1994. 

[GJ79] Michael J. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory 

of NP-completeness. Freeman, 1979. 



122 



[GK05] 
[Glu61] 
[GN98] 
[Got82] 
[GT83] 

[Gus97] 
[Hag98] 

[Han04] 
[Han06] 

[Hir75] 

[HMP01] 

[HN05] 

[H082] 

[HPOl] 

[HT84] 
[HT02] 

[HTGK03] 



Minos Garofalakis and Amit Kumar. Xml stream processing using tree-edit distance embed- 
dings. ACM Trans. Database Syst., 30(l):279-332, 2005. 

Victor M. Glushkov. The abstract theory of automata. Russian Math. Surveys, 16(5): 1-53, 
1961. 

Arvind Gupta and Naomi Nishimura. Finding largest subtrees and smallest supertrees. Algo- 
rithmica, 21:183-210, 1998. 

O. Gotoh. An improved algorithm for matching biological sequences. J. Molecular Biology, 
162(3):705-708, 1982. 

Harold N. Gabow and Robert Endre Tarjan. A linear-time algorithm for a special case of dis- 
joint set union. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing, 
pages 246-251, 1983. 

Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational 
biology. Cambridge, 1997. 

Torben Hagerup. Sorting and searching on the word ram. In Proceedings of the 15th Annual 
Symposium on Theoretical Aspects of Computer Science, Lecture Notes in Computer Science, 
volume 1373, pages 366-398, 1998. 

Yijic Han. Improved algorithm for all pairs shortest paths. Inform. Process. Lett., 91(5):245- 
250, 2004. 

Yijie Han. An 0(n 3 (loglogn/logn) 5 / 4 ) time algorithm for all pairs shortest paths. In Pro- 
ceedings of the 14-th Annual European Symposium on Algorithms, Lecture Notes in Computer 
Science, volume 4168, pages 411-417, 2006. 

D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. 
Commun. ACM, 18(6):341-343, 1975. 

Torben Hagerup, Peter Bro Milterscn, and Rasmus Pagh. Deterministic dictionaries. J. 
Algorithms, 41(l):69-85, 2001. 

Hcikki Hyyro and Gonzalo Navarro. Bit-parallel witnesses and their applications to approxi- 
mate string matching. Algorithmica, 41(3):203-231, 2005. 

Christoph M. Hoffmann and Michael J. O'Donnell. Pattern matching in trees. J. ACM, 
29(l):68-95, 1982. 

Haruo Hosoya and Benjamin Pierce. Regular expression pattern matching for XML. In 
Proceedings of the 28th ACM SIGPLAN-SIGACT Symposium on Principles of Programming 
Languages, pages 67-80, 2001. 

D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. 
Comput., 13(2):338-355, 1984. 



Yijic Han and Mikkel Thorup. Integer sorting in 0(n\/\og logn) expected time and linear 
space. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer 
Science, pages 135-144, 2002. 

Matthias Hochsmann, Thomas Toller, Robert Giegerich, and Stefan Kurtz. Local similarity 
in rna secondary structures. In Proceedings of the IEEE Computer Society Conference on 
Bioinformatics, pages 159-158, 2003. 



123 



[ISY03] Lucian Hie, Baozhen Shan, and Shcng Yu. Fast algorithms for extended regular expression 

matching and searching. In Proceedings of the 20th Annual Symposium on Theoretical Aspects 
of Computer Science, Lecture Notes In Computer Science, volume Vol. 2607, pages 179-190, 
2003. 

[JHS06] Jesper Jansson, Ngo Trung Hicu, and Wing-Kin Sung. Local gapped subforest alignment and 

its application in finding rna structural motifs. J. Comp. Biology, 13(3):702-718, 2006. 

[JL01] Jesper Jansson and Andrzej Lingas. A fast algorithm for optimal alignment between simi- 

lar ordered trees. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern 
Matching, Lecture notes of Computer Science, volume 2089, 2001. 

[JL03] Jesper Jansson and Andrzej Lingas. A fast algorithm for optimal alignment between similar 

ordered trees. Fundam. Inform., 56(1-2):105-120, 2003. 

[Jor69] C. Jordan. Sur les assemblages des lignes. J. Reine Angew. Math., 70:185-190, 1869. 

[JP06] Jesper Jansson and Zeshan Peng. Algorithms for finding a most similar subforest. In Pro- 

ceedings of the 17th Annual Symposium on Combinatorial Pattern Matching, Lecture notes of 
Computer Science, volume 4009, pages 377-388, 2006. 

[JWZ95] Tao Jiang, Lusheng Wang, and Kaizhong Zhang. Alignment of trees - an alternative to tree 
edit. Theoret. Comput. Sci, 143(1):137-148, 1995. 

[KA94] Dmitry Kcselman and Amihood Amir. Maximum agreement subtree in a set of evolutionary 

trees - metrics and efficient algorithms. In Proceedings of the 35th Annual IEEE Symposium 
on Foundations of Computer Science, pages 758-769, 1994. 

[Kil92] Pckka Kilpelalnen. Tree Matching Problems with Applications to Structured Text Databases. 

PhD thesis, University of Helsinki, Department of Computer Science, November 1992. 

[Kle98] P.N. Klein. Computing the edit-distance between unrooted ordered trees. In Proceedings of the 

6th Annual European Symposium on Algorithms, Lecture Notes in Computer Science, volume 
1461, pages 91-102, 1998. 

[Klc02] Philip Klein, 2002. Personal communication. 

[KM93] Pekka Kilpelalnen and Hcikki Mannila. Retrieval from hierarchical texts by partial patterns. 

In Proceedings of the 16th Conference on Research and Development in Information Retrieval, 
pages 214-222, 1993. 

[KM95a] Pekka Kilpelalnen and Hcikki Mannila. Ordered and unordered tree inclusion. SIAM J. 
Comput, 24:340-356, 1995. 

[KM95b] James R. Knight and Eugene W. Myers. Super-pattern matching. Algorithmica, 13(1/2):211- 
243, 1995. 

[KM95c] James Robert Knight and Eugene W. Myers. Approximate regular expression pattern match- 
ing with concave gap penalties. Algorithmica, 14:85-121, 1995. 

[KMY95] S. Khanna, R. Motwani, and F. F. Yao. Approximation algorithms for the largest common 
subtree problem. Technical report, Stanford University, 1995. 

[Knu69] Donald Erwin Knuth. The Art of Computer Programming, Volume 1. Addison Wesley, 1969. 

[KNU03] Juha Karkkainen, Gonzalo Navarro, and Esko Ukkonen. Approximate string matching on 
Ziv-Lempel compressed text. J. Discrete Algorithms, l(3-4):313-338, 2003. 



124 



[Kos89] S. Rao Kosaraju. Efficient tree pattern matching. In Proceedings of the 30th Annual IEEE 

Symposium on the Foundations of Computer Science, pages 178-183, 1989. 

[KTS+98] Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Masamichi Miyazaki, and Setsuo Arikawa. 

Multiple pattern matching in LZW compressed text. In Proceedings of the 8th Data Compres- 
sion Conference, pages 103-112, 1998. 

[KTSK00] Philip Klein, Srikanta Tirthapura, Daniel Sharvit, and Ben Kimia. A tree-edit-distancc al- 
gorithm for comparing simple, closed shapes. In Proceedings of the 11th Annual ACM-SIAM 
Symposium on Discrete Algorithms, pages 696-704, 2000. 

[LM01] Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expres- 

sions. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 
361-370, 2001. 

[LMS98] Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string comparison. 
SIAM J. Comput., 27(2):557-582, 1998. 

[LST01] Chin Lung Lu, Zheng- Yao Su, and Chuan Yi Tang. A new measure of edit distance between 

labeled trees. In Proceedings of the 7th Annual International Computing and Combinatorics 
Conference, Lecture Notes in Computer Science, volume 2108, pages 338-348, 2001. 

[Lu79] S. Y. Lu. A trec-to-tree distance and its application to cluster analysis. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 1:219-224, 1979. 

[Lu84] S. Y. Lu. A tree-matching algorithm based on node splitting and merging. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 6(2):249-256, 1984. 

[LV89] G. M. Landau and U. Vishkin. Fast parallel and serial approximate string matching. J. 

Algorithms, 10:157-169, 1989. 

[MKT+00] Tetsuya Matsumoto, Takuya Kida, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. 

Bit-parallel approach to approximate string matching in compressed texts. In Proceedings 
of the 7th International Symposium on String Processing and Information Retrieval, pages 
221-228, 2000. 

[MM88] Webb Miller and Eugene W. Myers. Sequence comparison with concave weighting functions. 

Bull, of Math. Biology, 50(2):97-120, 1988. 

[MM89] E. W. Myers and W. Miller. Approximate matching of regular expressions. Bull, of Math. 

Biology, 51:5-37, 1989. 

[MM96] S. Muthukrishnan and Martin Miiller. Time and space efficient method-lookup for object- 

oriented programs. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete 
Algorithms, pages 42-51, 1996. 

[MN90] K. Mehlhorn and S. Nahler. Bounded ordered dictionaries in O(loglogTV) time and 0(n) 

space. Inform. Process. Lett, 35(4):183-189, 1990. 

[MNU05] Veli Makinen, Gonzalo Navarro, and Esko Ukkonen. Transposition invariant string matching. 
J. Algorithms, 56(2):124-153, 2005. 

[MOG98] Eugene W. Myers, Paulo Oliva, and Katia S. Guimaracs. Reporting exact and approximate 
regular expression matches. In Proceedings of the 9th Annual Symposium on Combinatorial 
Pattern Matching, Lecture Notes in Computer Science, volume 1448, pages 91-103, 1998. 

[Mot92] Rajeev Motwani. Lecture notes on approximation algorithms volume 1. Technical Report 

STAN-CS-92-1435, Stanford University, Department of Computer Science, 1992. 



125 



[MP80] W. Masek and M. Paterson. A faster algorithm for computing string edit distances. J. Corn-put. 

System Sci, 20:18-31, 1980. 

[MR90] Hcikki Mannila and K. J. Raiha. On query languages for the p-string data model. Information 

Modelling and Knowledge Bases, pages 469-482, 1990. 

[MT92] Jiri Matousek and R. Thomas. On the complexity of finding iso- and other morphisms for 

partial fc-trees. Discrete Math., 108:343-364, 1992. 

[MUN03] Veli Makincn, Esko Ukkonen, and Gonzalo Navarro. Approximate matching of run-length 
compressed strings. Algorithmica, 35(4):347-369, 2003. 

[MurOl] Makoto Murata. Extended path expressions of XML. In Proceedings of the 20th ACM Sym- 

posium on Principles of Database Systems, pages 126-137, 2001. 

[MY60] R. McNaughton and H. Yamada. Regular expressions and state graphs for automata. IRE 

Trans, on Electronic Computers, 9(l):39-47, 1960. 

[Mye86] Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251~ 

266, 1986. 

[Mye91] Eugene W. Myers. An overview of sequence comparison algorithms in molecular biology. 

Technical Report 91-29, Univ. of Arizona, Dept. of Computer Science, 1991. 

[Mye92a] E. W. Myers. A four-russian algorithm for regular expression pattern matching. J. ACM, 
39(2):430-448, 1992. 

[Mye92b] Eugene W. Myers. Approximate matching of network expressions with spacers. J. of Compu- 
tational Biology, 3(1):33-51, 1992. 

[Mye99] Gene Myers. A fast bit-vector algorithm for approximate string matching based on dynamic 

programming. J. ACM, 46(3):395-415, 1999. 

[NavOla] Gonzalo Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 
33(l):31-88, 2001. 

[NavOlb] Gonzalo Navarro. NR-grep: a fast and flexible pattern-matching tool. Software - Practice and 
Experience, 31(13):1265-1312, 2001. 

[Nav03] Gonzalo Navarro. Regular expression searching on compressed text. J. Discrete Algorithms, 

l(5-6):423-443, 2003. 

[Nav04] Gonzalo Navarro. Approximate regular expression searching with arbitrary integer weights. 

Nordic J. Comput., ll(4):356-373, 2004. 

[NKT+01] Gonzalo Navarro, Takuya Kida, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. 

Faster approximate string matching over compressed text. In Proceedings of the 11th Data 
Compression Conference, page 459, Washington, DC, USA, 2001. IEEE Computer Society. 

[NR98] G. Navarro and M. Rafhnot. A general practical approach to pattern matching over Ziv- 

Lempel compressed text. Technical Report TR/DCC-98-12, Dept. of Computer Science, Univ. 
of Chile., 1998. 

[NR03] Gonzalo Navarro and Mathieu Raflmot. Fast and simple character classes and bounded gaps 

pattern matching, with applications to protein searching. J. Comp. Biology, 10(6):903-923, 
2003. 



126 



[NR04] Gonzalo Navarro and Mathieu Raffinot. New techniques for regular expression searching. 

Algorithmica, 41(2):89-116, 2004. 

[NRT00] Naomi Nishimura, Prabhakar Ragde, and Dimitrios M. Thilikos. Finding smallest supertrees 
under minor containment. Int. J. Found. Comput. Sci., ll(3):445-465, 2000. 

[OFW99] Stuart Oberman, Greg Favor, and Fred Weber. AMD 3DNow! technology: Architecture and 
implementations. IEEE Micro, 19(2):37-48, 1999. 

[PT06] Mihai Patrascu and Mikkel Thorup. Time-space trade-offs for predecessor search. In Proceed- 

ings of the 38th Annual ACM Symposium on Theory of Computing, pages 232-240, 2006. 

[PWW97] Alex Peleg, Sam Wilkic, and Uri Weiser. Intel MMX for multimedia PCs. Commun. ACM, 
40(l):24-38, 1997. 

[Ric97a] Thorstcn Richtcr. A new algorithm for the ordered tree inclusion problem. In Proceedings 
of the 8th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes of Computer 
Science, volume 1264, pages 150-166, 1997. 

[Ric97b] Thorsten Richter. A new measure of the distance between ordered trees and its applications, 
technical report 85166-cs. Technical report, Department of Computer Science, University of 
Bonn, 1997. 

[RR92] R. Ramcsh and I.V. Ramakrishnan. Nonlinear pattern matching in trees. J. ACM, 39(2) :295- 

316, 1992. 

[Ruz04] Milan Ruzic. Algorithms for deterministic construction of efficient dictionaries. In Proceedings 

of the 12th Annual European Symposium on Algorithms, Lecture Notes in Computer Science, 
pages 592-603, 2004. 

[Ryt99] Wojciech Rytter. Algorithms on compressed strings and arrays. In Proceedings of the 26th 

Conference on Current Trends in Theory and Practice of Informatics on Theory and Practice 
of Informatics, Lecture Notes in Computer Science, volume 1725, pages 48-65, 1999. 

[Sel77] Stanley M. Sclkow. The tree-to-tree editing problem. Inform. Process. Lett., 6(6):184-186, 

1977. ' 

[Sel80] P. Sellers. The theory and computation of evolutionary distances: Pattern recognition. J. 

Algorithms, 1:359-373, 1980. 

[SM97] J. Sctubal and J. Meidanis. Introduction to Computational Biology. PWS Publishing Company, 

1997. 

[SM02] Torsten Schlieder and Holger Meuss. Querying and ranking XML documents. J. Am. Soc. 

Inf. Sci. Technol, 53(6):489-503, 2002. 

[SN00] T. Schlieder and F. Naumann. Approximate tree embedding for querying XML data. In ACM 

SIGIR Workshop On XML and Information Retrieval, 2000. 

[ST99] R. Shamir and D. Tsur. Faster subtree isomorphism. J. Algorithms, 33:267-280, 1999. 

[Sta81] Richard M. Stallman. Emacs the extensible, customizable self-documenting display editor. 

SIGPLAN Not, 16(6):147-156, 1981. 

[SWSZ02] Dennis Shasha, Jason Tsong-Li Wang, Huiyuan Shan, and Kaizhong Zhang. Atreegrep: Ap- 
proximate searching in unordered trees. In Proceedings of the l^th International Conference 
on Scientific and Statistical Database Management, pages 89-98, 2002. 



127 



[SZ90] Dennis Shasha and Kaizhong Zhang. Fast algorithms for the unit cost editing distance between 

trees. J. Algorithms, 11:581-621, 1990. 

[SZ97] Dennis Shasha and Kaizhong Zhang. Approximate tree pattern matching. In Pattern Matching 

in String, Trees and Arrays, pages 341-371. Oxford University, 1997. 

[Tai79] Kuo-Chung Tai. The tree-to-tree correction problem. J. ACM, 26:422-433, 1979. 

[Tak04] T. Takaoka. A faster algorithm for the all-pairs shortest path problem and its application. 

In Proceedings of the 10th Annual International Computing and Combinatorics Conference, 
Lecture Notes in Computer Science, volume 3106, pages 278-289, 2004. 

[Tan95] Eiichi Tanaka. A note on a tree-to-tree editing problem. International Journal of Pattern 

Recognition and Artificial Intelligence, 9(1):167-172, 1995. 

[TH99] Shreckant (Ticky) Thakkar and Tom Huff. Internet streaming SIMD extensions. Computer, 

32(12):26-34, 1999. 

[Tho68] K. Thompson. Regular expression search algorithm. Commun. ACM, 11:419-422, 1968. 

[Tho99] Mikkel Thorup. Undirected single-source shortest paths with positive integer weights in linear 

time. J. ACM, 46(3):362-394, 1999. 

[Tho03] Mikkel Thorup. Space efficient dynamic stabbing with fast queries. In Proceedings of the 33rd 

Annual ACM Symposium on Theory of Computing, pages 649-658, 2003. 

[TONH96] Marc Trcmblay, J. Michael O'Connor, Venkatesh Narayanan, and Liang He. Vis speeds new 
media processing. IEEE Micro, 16(4):10-20, 1996. 

[Tou03] Helene Touzet. Tree edit distance with gaps. Inform. Process. Lett, 85(3):123-129, 2003. 

[Tou05] Helene Touzet. A linear tree edit distance algorithm for similar ordered trees. In Proceedings of 

the 16th Annual Symposium on Combinatorial Pattern Matching, Lecture notes in Computer 
Science, volume 3537, pages 334-345, 2005. 

[TroOl] Zdcnck Tronicck. Searching subsequences. Ph. D. Thesis, Department of Computer Science 

and Engineering, FEE CTU in Prague, 2001. 

[TRS02] A. Termier, M. Rousset, and M. Sebag. Treefinder: a first step towards XML data mining. In 
Proceedings of the 2nd International Conference on Data Mining, page 450, 2002. 

[TSKK98] Srikanta Tirthapura, Daniel Sharvit, Philip Klein, and Benjamin B. Kimia. Indexing based 
on edit-distance matching of shape graphs. In Proceeding of SPIE International Symposium 
on Voice, Video and Data Communications, pages 91-102, 1998. 

[TT88] Eiichi Tanaka and Keiko Tanaka. The tree-to-tree editing problem. International Journal of 

Pattern Recognition and Artificial Intelligence, 2(2):221-240, 1988. 

[Ukk85a] Esko Ukkoncn. Algorithms for approximate string matching. Inf. Control, 64(1-3): 100-1 18, 
1985. 

[Ukk85b] Esko Ukkonen. Finding approximate patterns in strings. J. Algorithms, 6:132-137, 1985. 

[Val05] Gabriel Valientc. Constrained tree inclusion. J. Discrete Algorithms, 3(2-4):431-447, 2005. 

[vEB77] Peter van Emde Boas. Preserving order in a forest in less than logarithmic time and linear 
space. Inform. Process. Lett., 6(3):80-82, 1977. 



128 



[vEBKZ77] Peter van Emde Boas, R. Kaas, and E. Zijlstra. Design and implementation of an efficient 
priority queue. Mathematical Systems Theory, 10:99-127, 1977. 

[Wal94] Larry Wall. The Perl Programming Language. Prentice Hall Software Series, 1994. 

[Wat95] M.S. Waterman. Introduction to Computational Biology. Chapman & Hall, 1995. 

[Wel84] Terry A. Welch. A technique for high-performance data compression. IEEE Computer, 17(6) :8- 

19, 1984. 

[WF74] Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 

21:168-173, 1974. 

[WM92a] S. Wu and U. Manber. Agrep - a fast approximate pattern-matching tool. In Proceedings 
USENIX Winter 1992 Technical Conference, pages 153-162, 1992. 

[WM92b] Sun Wu and Udi Manber. Fast text searching: allowing errors. Commun. ACM, 35(10):83-91, 
1992. 

[WMM95] S. Wu, U. Manber, and E. W. Myers. A subquadratic algorithm for approximate regular 
expression matching. J. Algorithms, 19(3):346-360, 1995. 

[Wri94] Alden H. Wright. Approximate string matching using within-word parallelism. Softw. Pract. 

Exper., 24(4):337-362, 1994. 

[WZ03] Lusheng Wang and Jianyun Zhao. Parametric alignment of ordered trees. Bioinformatics, 

19(17):2237-2245, 2003. 

[WZ05] Lusheng Wang and Kaizhong Zhang. Space efficient algorithms for ordered tree comparison. 

In Proceedings of he 16th Annual International Symposium on Algorithms and Computation, 
Lecture Notes in Computer Science, volume 3827, pages 380-391, 2005. 

[WZJS94] Jason Tsong-Li Wang, Kaizhong Zhang, Karpjoo Jeong, and Dennis Shasha. A system for ap- 
proximate tree matching. IEEE Transactions on Knowledge and Data Engineering, 6(4):559- 
571, 1994. 

[YamOl] Hiroaki Yamamoto. A new recognition algorithm for extended regular expressions. In Pro- 
ceedings of the 12th International Symposium on Algorithms and Computation, Lecture Notes 
in Computer Science, volume 2223, pages 257-267, 2001. 

[YLH03] Liang Huai Yang, Mong Li Lee, and Wynne Hsu. Efficient mining of XML query patterns for 
caching. In Proceedings of the 29th Conference on Very Large Data Bases, pages 69-80, 2003. 

[YLH04] Huai Yang, Li Lee, and Wynne Hsu. Finding hot query patterns over an XQuery stream. The 
VLDB Journal, 13(4):318-332, 2004. 

[YM03] Hiroaki Yamamoto and Takashi Miyazaki. A fast bit-parallel algorithm for matching extended 

regular expressions. In Proceeding of the 9th Annual International Computing and Combina- 
torics Conference, Lecture Notes in Computer Science, volume 2697, pages 222-231, 2003. 

[ZADR03] P. Zezula, G. Amato, F. Debole, and F. Rabitti. Tree signatures for XML querying and 
navigation. In Proceedings of the 1st International XML Database Symposium, pages 149-163, 
2003. 

[Zha89] Kaizhong Zhang. The Editing Distance Between Trees: Algorithms and Applications. PhD 

thesis, Courant Institute, Department of Computer Science, 1989. 



129 



[Zha95] Kaizhong Zhang. Algorithms for the constrained editing problem between ordered labeled 

trees and related problems. Pattern Recognition, 28:463-474, 1995. 

[Zha96a] Kaizhong Zhang. A constrained edit distance between unordered labeled trees. Algorithmica, 
15(3):205-222, 1996. 

[Zha96b] Kaizhong Zhang. Efficient parallel algorithms for tree editing problems. In Proceedings of the 
7th Annual Symposium Combinatorial Pattern Matching, Lecture Notes in Computer Science, 
volume 1075, pages 361-372, 1996. 

[ZJ94] Kaizhong Zhang and Tao Jiang. Some MAX SNP-hard results concerning unordered labeled 

trees. Inform. Process. Lett, 49:249-254, 1994. 

[ZL77] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE 

Trans. Inform. Theory, 23(3):337-343, 1977. 

[ZL78] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. 

IEEE Trans. Inform. Theory, 24(5):530-536, 1978. 

[ZS89] Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between 

trees and related problems. SIAM J. Comput., 18:1245-1262, 1989. 

[ZSS91] Kaizhong Zhang, Rick Statman, and Dennis Shasha. On the editing distance between un- 

ordered labeled trees. Technical Report 289, The University of Western Ontario, Department 
of Computer Science, 1991. 

[ZSS92] Kaizhong Zhang, Rick Statman, and Dennis Shasha. On the editing distance between un- 

ordered labeled trees. Inform. Process. Lett., 42:133-139, 1992. 

[ZSW94] Kaizhong Zhang, Dennis Shasha, and Jason T. L. Wang. Approximate tree matching in the 
presence of variable length don't cares. J. Algorithms, 16(l):33-66, 1994. 

[Zwi04] U. Zwick. A slightly improved sub-cubic algorithm for the all pairs shortest paths problem 

with real edge lengths. In Proceedings of the 15th International Symposium on Algorithms and 
Computation, Lecture Notes in Computer Science, volume 3341, pages 921-932, 2004. 

[ZWS96] Kaizhong Zhang, Jason Tsong-Li Wang, and Dennis Shasha. On the editing distance between 
undirected acyclic graphs. Int. J. Found. Comput. Sci., 7(l):43-58, 1996. 



130 



